AIML Capstone Project - Industrial safety¶

PROBLEM STATEMENT¶

• DOMAIN: Industrial safety. NLP based Chatbot.

• CONTEXT:

The database comes from one of the biggest industry in Brazil and in the world. It is an urgent need for industries/companies around the globe to understand why employees still suffer some injuries/accidents in plants. Sometimes they also die in such environment.

• DATA DESCRIPTION:

The database is basically records of accidents from 12 different plants in 03 different countries, every line in the data is an occurrence of an accident.

Columns description:¶

‣ Data: timestamp or time/date information

‣ Countries: which country the accident occurred (anonymised)

‣ Local: the city where the manufacturing plant is located (anonymised)

‣ Industry sector: which sector the plant belongs to

‣ Accident level: from I to VI, it registers how severe was the accident (I means not severe but VI means very severe)

‣ Potential Accident Level: Depending on the Accident Level, the database also registers how severe the accident could have been (due to other factors involved in the accident)

‣ Genre: if the person is male of female

‣ Employee or Third Party: if the injured person is an employee or a third party

‣ Critical Risk: some description of the risk involved in the accident

‣ Description: Detailed description of how the accident happened.

Link to download the dataset: https://www.kaggle.com/ihmstefanini/industrial-safety-and-health-analytics-database [for reference only]

Project Objective:¶

Design a ML/DL based chatbot utility which can help the professionals to highlight the safety risk as per the incident description.

Project Task¶

Milestone 1:¶

‣ Input: Context and Dataset

‣ Process:

‣ Step 1: Import the data

‣ Step 2: Data cleansing

‣ Step 3: Data preprocessing (NLP Preprocessing techniques)

‣ Step 4: Data preparation - Cleansed data in .xlsx or .csv file

‣ Step 5: Design train and test basic machine learning classifiers

‣ Step 6: Interim report

‣ Submission: Interim report, Jupyter Notebook with all the steps in Milestone-1

Abstract:¶

(Summary of problem statement(objective), data, findings, approach to EDA, Pre-processing and model building and selection)

Objective¶

This report aims to make predictions of the accident severity by predicting the target variable "Accident Level" or "Potential Accident Level" using machine learning (ML) models based on the historical accident description collected from some of the biggest industries in Brazil and in the world. The primary objective of this project is to build an NLP-based chatbot utility that can predict the severity of accidents based on incident descriptions, enabling proactive measures to enhance workplace safety.

Data:¶

Each line in the database is an occurrence of an accident in 12 different cities of 3 countries and across 3 industry types (Mining, Metals, Others). There are 5 accident levels (I to V) and 6 potential accident levels (I to VI). There are 3 types of employees: direct employee, third party employee and third party employee working remotely. Crticial Risk column seems to cateogrize the accident and description column has description about the accident occured.

The dataset comprise of mostly categorical data. The cleaned data columns which would be used to perform the analysis are as follows: Date, Country, City, Industry Sector, Accident Level, Potential Accident Level, Gender, Employee Type, Critical Risk, Description.

Findings and Approach:¶

  • Data Cleaning: The dataset was provided in Excel format and was imported into a Pandas DataFrame. Initial data cleaning involved renaming columns for clarity, dropping irrelevant and duplicate records, and standardizing the dataset.
  • Exploratory Data Analysis (EDA): Univariate analyses were conducted on all categorical columns to gain insights into all features, bivariate, and multivariate analyses were conducted to gain insights into feature relationships and interactions. Key findings were documented.
  • Feature Identification: The "Description" feature was identified as the most critical variable for classifying the dependent variables "Accident Level" or "Potential Accident Level."
  • NLP Pre-processing: The "Description" feature was pre-processed to generate a new feature, "Preprocessed_Description." Pre-processing steps included lowercasing, removing stopwords, punctuation, special characters, and numbers, along with tokenization and lemmatization.
  • Data Preparation: The cleaned and pre-processed data was exported as a CSV file and re-imported as a DataFrame for modeling. The TF-IDF vectorizer was applied to "Preprocessed_Description," extracting the top 1,000 features to be used as independent variables.
  • Modeling Setup: The dataset was split into training and testing sets in an 80:20 ratio using stratification on the dependent variable. We considered "Preprocessed_Description" as the independent variable and "Accident Level" or "Potential Accident Level" as the dependent variables.
  • Model Selection: Several machine learning algorithms were tested, including Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Random Forest, and Gradient Boosting. Initial base models were trained to predict both "Accident Level" and "Potential Accident Level."
  • Target Variable Refinement: After analyzing model performance, "Accident Level" was chosen as the primary target variable due to consistently better results compared to "Potential Accident Level."
  • Handling Imbalance: The data exhibited a class imbalance, with approximately 74% of records falling under Accident Level I. To address this, the training data was balanced using the SMOTE technique.
  • Model Evaluation and Tuning: We evaluated three types of models for each classifier: base models trained on unbalanced data, base models trained on balanced data, and hyper-parameter-tuned models trained on balanced data. Performance metrics and relevance were considered to select the best-performing model.

The chosen model will be utilized to support real-time accident severity predictions, enabling proactive safety measures and more effective risk management within the organization.

Abstract:¶

(Summary of problem statement(objective), data, findings, approach to EDA, Pre-processing and model building and selection)

Objective¶

This report aims to make predictions of the accident severity by predicting the target variable "Accident Level" or "Potential Accident Level" using machine learning (ML) models based on the historical accident description collected from some of the biggest industries in Brazil and in the world. The primary objective of this project is to build an NLP-based chatbot utility that can predict the severity of accidents based on incident descriptions, enabling proactive measures to enhance workplace safety.

Data:¶

Each line in the database is an occurrence of an accident in 12 different cities of 3 countries and across 3 industry types (Mining, Metals, Others). There are 5 accident levels (I to V) and 6 potential accident levels (I to VI). There are 3 types of employees: direct employee, third party employee and third party employee working remotely. Crticial Risk column seems to cateogrize the accident and description column has description about the accident occured.

The dataset comprise of mostly categorical data. The cleaned data columns which would be used to perform the analysis are as follows: Date, Country, City, Industry Sector, Accident Level, Potential Accident Level, Gender, Employee Type, Critical Risk, Description.

Findings and Approach:¶

  • Data Cleaning: The dataset was provided in Excel format and was imported into a Pandas DataFrame. Initial data cleaning involved renaming columns for clarity, dropping irrelevant and duplicate records, and standardizing the dataset.
  • Exploratory Data Analysis (EDA): Univariate analyses were conducted on all categorical columns to gain insights into all features, bivariate, and multivariate analyses were conducted to gain insights into feature relationships and interactions. Key findings were documented.
  • Feature Identification: The "Description" feature was identified as the most critical variable for classifying the dependent variables "Accident Level" or "Potential Accident Level."
  • NLP Pre-processing: The "Description" feature was pre-processed to generate a new feature, "Preprocessed_Description." Pre-processing steps included lowercasing, removing stopwords, punctuation, special characters, and numbers, along with tokenization and lemmatization.
  • Data Preparation: The cleaned and pre-processed data was exported as a CSV file and re-imported as a DataFrame for modeling. The TF-IDF vectorizer was applied to "Preprocessed_Description," extracting the top 1,000 features to be used as independent variables.
  • Modeling Setup: The dataset was split into training and testing sets in an 80:20 ratio using stratification on the dependent variable. We considered "Preprocessed_Description" as the independent variable and "Accident Level" or "Potential Accident Level" as the dependent variables.
  • Model Selection: Several machine learning algorithms were tested, including Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Random Forest, and Gradient Boosting. Initial base models were trained to predict both "Accident Level" and "Potential Accident Level."
  • Target Variable Refinement: After analyzing model performance, "Accident Level" was chosen as the primary target variable due to consistently better results compared to "Potential Accident Level."
  • Handling Imbalance: The data exhibited a class imbalance, with approximately 74% of records falling under Accident Level I. To address this, the training data was balanced using the SMOTE technique.
  • Model Evaluation and Tuning: We evaluated three types of models for each classifier: base models trained on unbalanced data, base models trained on balanced data, and hyper-parameter-tuned models trained on balanced data. Performance metrics and relevance were considered to select the best-performing model.

The chosen model will be utilized to support real-time accident severity predictions, enabling proactive safety measures and more effective risk management wigathin the

NOTE: We have attempted a total of 20 models. Total of 5 models were created for classifying “Potential Accident Level” and 15 models(3 different variation of model for each of the 5 classifiers) for classifying “Accident Level”.ornization.

In [ ]:
# # Mounting Google Drive
# from google.colab import drive
# drive.mount('/content/drive')
Import Basic Libraries¶
In [ ]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Step 1: Import Dataset¶

In [ ]:
# Step 1:
# Import the excel data as dataframe using pandas
data_original = pd.read_excel("Data+Set+-+industrial_safety_and_health_database_with_accidents_description.xlsx")
# data_original = pd.read_excel('/content/drive/MyDrive/Great Learning/Capstone/Data+Set+-+industrial_safety_and_health_database_with_accidents_description.xlsx')
data_original.head()
Out[ ]:
Unnamed: 0 Data Countries Local Industry Sector Accident Level Potential Accident Level Genre Employee or Third Party Critical Risk Description
0 0 2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f...
1 1 2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum...
2 2 2016-01-06 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170...
3 3 2016-01-08 Country_01 Local_04 Mining I I Male Third Party Others Being 9:45 am. approximately in the Nv. 1880 C...
4 4 2016-01-10 Country_01 Local_04 Mining IV IV Male Third Party Others Approximately at 11:45 a.m. in circumstances t...

Shape of data:

In [ ]:
print("There are {0} rows and {1} columns in the original Data Frame".format(data_original.shape[0], data_original.shape[1]))
There are 425 rows and 11 columns in the original Data Frame
In [ ]:
# Checking datatypes and null values in the dataset columns
data_original.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 425 entries, 0 to 424
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Unnamed: 0                425 non-null    int64         
 1   Data                      425 non-null    datetime64[ns]
 2   Countries                 425 non-null    object        
 3   Local                     425 non-null    object        
 4   Industry Sector           425 non-null    object        
 5   Accident Level            425 non-null    object        
 6   Potential Accident Level  425 non-null    object        
 7   Genre                     425 non-null    object        
 8   Employee or Third Party   425 non-null    object        
 9   Critical Risk             425 non-null    object        
 10  Description               425 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(9)
memory usage: 36.7+ KB
In [ ]:
# checking for null values
data_original.isnull().sum()
Out[ ]:
Unnamed: 0                  0
Data                        0
Countries                   0
Local                       0
Industry Sector             0
Accident Level              0
Potential Accident Level    0
Genre                       0
Employee or Third Party     0
Critical Risk               0
Description                 0
dtype: int64
In [ ]:
# checking for uniques values in each column
data_original.nunique()
Out[ ]:
Unnamed: 0                  425
Data                        287
Countries                     3
Local                        12
Industry Sector               3
Accident Level                5
Potential Accident Level      6
Genre                         2
Employee or Third Party       3
Critical Risk                33
Description                 411
dtype: int64

Insights and Pre-Processing/Data Cleaning approach:

  • From the above output it is observed that "unnamed: 0" column has int64 datatype, "Data" column has datetime64 type and all other columns are string or object datatype.
  • "Unnamed" column can be removed as it does not seem to have any significance.
  • "Data" column is of datetime type and contains date, this column should be renamed as "Date"
  • The categorical columns such as "Countries", "Local", "Industry Sector", "Accident Level", "Potential Accident Level", "Genre", "Employee or Third Party", "Critical Risk" and "Description" are all relevant columns.
  • "Genre" column contains information about Gender, hence it can be renamed as "Gender"
  • "Employee or Third Party" column contains information about employee type hence it can be renamed as "Employee Type"
  • "Local" column contains information about anonymized city names, therefore, it can be renamed as "City"
  • "Countries" is a plural term, it can be renamed as a standard singular name like "Country"
  • There are no null values in the data set.

Step 2: Data cleaning¶

In [ ]:
# Removing unnamed column
df = data_original.drop("Unnamed: 0", axis=1)
df.head()
Out[ ]:
Data Countries Local Industry Sector Accident Level Potential Accident Level Genre Employee or Third Party Critical Risk Description
0 2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f...
1 2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum...
2 2016-01-06 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170...
3 2016-01-08 Country_01 Local_04 Mining I I Male Third Party Others Being 9:45 am. approximately in the Nv. 1880 C...
4 2016-01-10 Country_01 Local_04 Mining IV IV Male Third Party Others Approximately at 11:45 a.m. in circumstances t...
In [ ]:
# Renaming columns: 'Data', 'Countries', 'Local', 'Genre' and 'Employee or Third Party'
df.rename(columns={'Data':'Date', 'Countries':'Country', 'Local':'City', 'Genre':'Gender', 'Employee or Third Party':'Employee type'}, inplace=True)
df.head()
Out[ ]:
Date Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description
0 2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f...
1 2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum...
2 2016-01-06 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170...
3 2016-01-08 Country_01 Local_04 Mining I I Male Third Party Others Being 9:45 am. approximately in the Nv. 1880 C...
4 2016-01-10 Country_01 Local_04 Mining IV IV Male Third Party Others Approximately at 11:45 a.m. in circumstances t...
In [ ]:
# Checking the duplicate records
df.duplicated().sum()
Out[ ]:
7
  • There are 7 duplicated records
In [ ]:
# Viewing the duplicate records
df[df.duplicated()]
Out[ ]:
Date Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description
77 2016-04-01 Country_01 Local_01 Mining I V Male Third Party (Remote) Others In circumstances that two workers of the Abrat...
262 2016-12-01 Country_01 Local_03 Mining I IV Male Employee Others During the activity of chuteo of ore in hopper...
303 2017-01-21 Country_02 Local_02 Mining I I Male Third Party (Remote) Others Employees engaged in the removal of material f...
345 2017-03-02 Country_03 Local_10 Others I I Male Third Party Venomous Animals On 02/03/17 during the soil sampling in the re...
346 2017-03-02 Country_03 Local_10 Others I I Male Third Party Venomous Animals On 02/03/17 during the soil sampling in the re...
355 2017-03-15 Country_03 Local_10 Others I I Male Third Party Venomous Animals Team of the VMS Project performed soil collect...
397 2017-05-23 Country_01 Local_04 Mining I IV Male Third Party Projection of fragments In moments when the 02 collaborators carried o...
In [ ]:
# Removing the duplicate records
df.drop_duplicates(inplace=True)
df.shape
Out[ ]:
(418, 10)
  • After dropping 7 duplicate records from 425 records, 418 rows are left. After dropping 1 column i.e. "Unnamed: 0" only 10 columns are remaining in the dataset.
In [ ]:
# getting information on the data after some data cleaning activities
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 418 entries, 0 to 424
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Date                      418 non-null    datetime64[ns]
 1   Country                   418 non-null    object        
 2   City                      418 non-null    object        
 3   Industry Sector           418 non-null    object        
 4   Accident Level            418 non-null    object        
 5   Potential Accident Level  418 non-null    object        
 6   Gender                    418 non-null    object        
 7   Employee type             418 non-null    object        
 8   Critical Risk             418 non-null    object        
 9   Description               418 non-null    object        
dtypes: datetime64[ns](1), object(9)
memory usage: 35.9+ KB
In [ ]:
# getting more information about the cleaned data
df.describe(include = 'object')
Out[ ]:
Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description
count 418 418 418 418 418 418 418 418 418
unique 3 12 3 5 6 2 3 33 411
top Country_01 Local_03 Mining I IV Male Third Party Others During the activity of chuteo of ore in hopper...
freq 248 89 237 309 141 396 185 229 2

Insights:

  • There are 3 countries in the dataset and "Country_01" has more than 50% of the total industrial incidents.
  • There are 12 cities in the dataset and "Local_03" which belongs to "Country_01" has the most number of industrial incidents compared to other cities.
  • There are 3 categories of Industrial sector and "Mining" has the largest and more than 50% of the total number of incidents.
  • There are 5 unique accident levels and there are around 75% of accidents at accident level 1, which means the majority of the incidents are not severe.
  • There are 6 unique potential accident levels and the majority of the accidents have a potential accident level of 4.
  • Around 95% of the workers are males in all the mentioned accidents.
  • There are 3 types of employees. The majority of employees i.e. just under 50% of them are third-party employees.
  • Majority of the risk category under the "Critical Risk" column is mentioned as "Others".
  • There are 411 unique descriptions out of total 418 accident descriptions which means some accidents description exactly match each other.
  • To conclude, there is a high chance of the incidents occurrence for the following profile: country as country_01, Local_03 city, mining industrial sector, third-party employee, male worker. We can investigate further on this.

EDA - Univariate Analysis¶

In [ ]:
# Calculate value counts and percentages for the 'Category' column
value_counts = df['Country'].value_counts()
percentages = df['Country'].value_counts(normalize=True) * 100

# Plot the bar chart
plt.figure(figsize=(8, 6))
value_counts.plot(kind='bar', color='skyblue')
plt.title('Value Counts and Percentages for Country Column')
plt.ylabel('Count')
plt.xlabel('Country')

# Show the counts and percentages on top of the bars
for i, (count, pct) in enumerate(zip(value_counts, percentages)):
    plt.text(i, count + 1, f"{count} ({pct:.2f}%)", ha='center', fontweight='bold')

plt.show()
No description has been provided for this image

Insights:

  • 59.33% which is majority of the accidents happened in country_01
  • 30.86% of the accidents happened in country_02
  • 9.81% of the accidents happened in country_03
In [ ]:
# Calculate value counts and percentages for the 'City' column
value_counts = df['City'].value_counts()
percentages = df['City'].value_counts(normalize=True) * 100

# Plot the bar chart
plt.figure(figsize=(16, 6))
value_counts.plot(kind='bar', color='skyblue')
plt.title('Value Counts and Percentages for Country Column')
plt.ylabel('Count')
plt.xlabel('City')

# Show the counts and percentages on top of the bars
for i, (count, pct) in enumerate(zip(value_counts, percentages)):
    plt.text(i, count + 0.5, f"{count} ({pct:.2f}%)", ha='center', fontweight='bold')

plt.show()
No description has been provided for this image

Insights:

  • The highest number of incidents happened in "Local_03" city.
  • The lowest number of incidents happened in "Local_09" and "Local_11" city.
In [ ]:
# Calculate value counts and percentages for the 'Industry Sector' column
value_counts = df['Industry Sector'].value_counts()
percentages = df['Industry Sector'].value_counts(normalize=True) * 100

# Plot the bar chart
plt.figure(figsize=(10, 6))
value_counts.plot(kind='bar', color='skyblue')
plt.title('Value Counts and Percentages for Industry Sector Column')
plt.ylabel('Count')
plt.xlabel('Industry Sector')

# Show the counts and percentages on top of the bars
for i, (count, pct) in enumerate(zip(value_counts, percentages)):
    plt.text(i, count + 0.5, f"{count} ({pct:.2f}%)", ha='center', fontweight='bold')

plt.show()
No description has been provided for this image

Insights:

  • Around 57% of incidents happened in Mining industry sector.
  • Around 32% of incidents happened in Metals industry sector.
  • Around 11% of incidents happened in other industry sector.
In [ ]:
# Calculate value counts and percentages for the 'Accident Level' column
value_counts = df['Accident Level'].value_counts()
percentages = df['Accident Level'].value_counts(normalize=True) * 100

# Plot the bar chart
plt.figure(figsize=(10, 6))
value_counts.plot(kind='bar', color='skyblue')
plt.title('Value Counts and Percentages for Accident Level Column')
plt.ylabel('Count')
plt.xlabel('Accident Level')

# Show the counts and percentages on top of the bars
for i, (count, pct) in enumerate(zip(value_counts, percentages)):
    plt.text(i, count + 0.5, f"{count} ({pct:.2f}%)", ha='center', fontweight='bold')

plt.show()
No description has been provided for this image

Insights:

  • Majority of incidents have accident level 1.
  • As the accidents levels are increasing the number on accidents are decreasing, which means there are less extremely severe accidents and more less severe accidents.
In [ ]:
# Calculate value counts and percentages for the 'Potential Accident Level' column
value_counts = df['Potential Accident Level'].value_counts()
percentages = df['Potential Accident Level'].value_counts(normalize=True) * 100

# Plot the bar chart
plt.figure(figsize=(10, 6))
value_counts.plot(kind='bar', color='skyblue')
plt.title('Value Counts and Percentages for Potential Accident Level Column')
plt.ylabel('Count')
plt.xlabel('Potential Accident Level')

# Show the counts and percentages on top of the bars
for i, (count, pct) in enumerate(zip(value_counts, percentages)):
    plt.text(i, count + 0.5, f"{count} ({pct:.2f}%)", ha='center', fontweight='bold')

plt.show()
No description has been provided for this image

Insights:

  • There are 33.73% of accidents which has potential accident level as 4
  • The accident occurrences reduces as the potential accident level reduces from 4 to 1.
  • There are very few accident occurences with potential accident level 5 and 6.
In [ ]:
# Calculate value counts and percentages for the 'Gender' column
value_counts = df['Gender'].value_counts()
percentages = df['Gender'].value_counts(normalize=True) * 100

# Plot the bar chart
plt.figure(figsize=(10, 6))
value_counts.plot(kind='bar', color='skyblue')
plt.title('Value Counts and Percentages for Gender Column')
plt.ylabel('Count')
plt.xlabel('Gender')

# Show the counts and percentages on top of the bars
for i, (count, pct) in enumerate(zip(value_counts, percentages)):
    plt.text(i, count + 0.5, f"{count} ({pct:.2f}%)", ha='center', fontweight='bold')

plt.show()
No description has been provided for this image

Insights:

  • Around 95% of accidents involve male workers.
  • This also indicates that the majority of the workers are male.
  • The vast majority of accidents (396) involve male workers, which is striking when compared to the 22 accidents involving female workers. This substantial difference suggests that male employees may be engaging in more hazardous tasks or may have a higher exposure to risks in the workplace.
  • The low number of accidents involving female workers may indicate either lower participation in high-risk roles or effective safety practices in areas where women are employed.
In [ ]:
# Calculate value counts and percentages for the 'Employee type' column
value_counts = df['Employee type'].value_counts()
percentages = df['Employee type'].value_counts(normalize=True) * 100

# Plot the bar chart
plt.figure(figsize=(10, 6))
value_counts.plot(kind='bar', color='skyblue')
plt.title('Value Counts and Percentages for Employee type Column')
plt.ylabel('Count')
plt.xlabel('Employee type')

# Show the counts and percentages on top of the bars
for i, (count, pct) in enumerate(zip(value_counts, percentages)):
    plt.text(i, count + 0.5, f"{count} ({pct:.2f}%)", ha='center', fontweight='bold')

plt.show()
No description has been provided for this image

Insights:

  • There is almost an equal number of Third-party and direct employees who got involved in an accident in the industry. This also indicates that internal workers are nearly as vulnerable as third-party workers.
  • Third-party(Remote) employees are least involved in an industry accident might-be because they are working remotely and third-party employees are less in number. This also indicates that there are fewer operational risks or better safety management practices for remote workers compared to on-site third-party employees.
  • Third-party employees have the largest share of the employees getting impacted by the accidents, this may be due to lack of proper training and less experience in the working conditions of the current company.
In [ ]:
# Calculate value counts for the 'Critical Risk' column
value_counts = df['Critical Risk'].value_counts()

# Plot the bar chart
plt.figure(figsize=(15, 6))
value_counts.plot(kind='bar', color='skyblue')
plt.title('Value Counts for Critical Risk Column')
plt.ylabel('Count')
plt.xlabel('Critical Risk')

# Show the percentages on top of the bars
for i, v in enumerate(value_counts):
    plt.text(i, v + 0.5, f"{v:.0f}", size = 8, ha='center', fontweight='bold')

plt.show()
No description has been provided for this image

Insights:

  • The majority of the Critical risk is categorized under "Others" category.
  • Excluding the "Others" category, the majority of the critical risk in accidents is due to a body part getting "Pressed" while working followed by injuries by "Manual Tools" and exposure of "Chemical Substances".

Adding more useful columns to help with data analysis, EDA and explore possible patterns¶

  • Adding year, month, day, weekday, week of the year columns.
  • In the problem statement it was mentioned that the data is provided from biggest industries in Brazil and world. Therefore, assuming the seasons of the accidents based on brazil or southern hemisphere climatology as mentioned here: https://seasonsyear.com/Brazil
  • The get_season_brazil() function in the below code is based on Brazil's climatology:

Summer: December, January, February

Autumn: March, April, May

Winter: June, July, August

Spring: September, October, November

In [ ]:
df2 = df.copy()
In [ ]:
# Extract Year, Month, Day, Weekday, Week of the Year
df2['Date'] = pd.to_datetime(df2['Date'])
df2['Year'] = df2['Date'].dt.year
df2['Month'] = df2['Date'].dt.month
df2['Day'] = df2['Date'].dt.day
df2['Weekday'] = df2['Date'].dt.weekday  # Monday=0, Sunday=6
df2['Week of the Year'] = df2['Date'].dt.isocalendar().week  # ISO week of the year--------Need clarity to deal with 53rd week in january
## https://stackoverflow.com/questions/35749982/show-week-53-as-yearweek-in-python

# Function to determine the season based on Brazil's climatology
def get_season_brazil(month):
    if month in [12, 1, 2]:
        return 'Summer'
    elif month in [3, 4, 5]:
        return 'Autumn'
    elif month in [6, 7, 8]:
        return 'Winter'
    elif month in [9, 10, 11]:
        return 'Spring'

# Apply the function to create the 'Season' column based on Brazil's climatology
df2['Season'] = df2['Month'].apply(get_season_brazil)

# Display the updated DataFrame
df2.head()
Out[ ]:
Date Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description Year Month Day Weekday Week of the Year Season
0 2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f... 2016 1 1 4 53 Summer
1 2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum... 2016 1 2 5 53 Summer
2 2016-01-06 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170... 2016 1 6 2 1 Summer
3 2016-01-08 Country_01 Local_04 Mining I I Male Third Party Others Being 9:45 am. approximately in the Nv. 1880 C... 2016 1 8 4 1 Summer
4 2016-01-10 Country_01 Local_04 Mining IV IV Male Third Party Others Approximately at 11:45 a.m. in circumstances t... 2016 1 10 6 1 Summer
In [ ]:
df2.tail()
Out[ ]:
Date Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description Year Month Day Weekday Week of the Year Season
420 2017-07-04 Country_01 Local_04 Mining I III Male Third Party Others Being approximately 5:00 a.m. approximately, w... 2017 7 4 1 27 Winter
421 2017-07-04 Country_01 Local_03 Mining I II Female Employee Others The collaborator moved from the infrastructure... 2017 7 4 1 27 Winter
422 2017-07-05 Country_02 Local_09 Metals I II Male Employee Venomous Animals During the environmental monitoring activity i... 2017 7 5 2 27 Winter
423 2017-07-06 Country_02 Local_05 Metals I II Male Employee Cut The Employee performed the activity of strippi... 2017 7 6 3 27 Winter
424 2017-07-09 Country_01 Local_04 Mining I II Female Third Party Fall prevention (same level) At 10:00 a.m., when the assistant cleaned the ... 2017 7 9 6 27 Winter

Creating Trend charts of the number on incidents across day, week, month, year and seasons¶

In [ ]:
df3 = df2.copy()
In [ ]:
# Add a 'count' column to represent the occurrence of an incident
df3['count'] = 1  # Assuming each row represents one incident

# Set the 'Date' column as the DataFrame index
df3.set_index('Date', inplace=True)

# Resample data by day, week, month, and year to count incidents
daily_data = df3.resample('D').sum()  # Resampling by day
weekly_data = df3.resample('W').sum()  # Resampling by week
monthly_data = df3.resample('M').sum()  # Resampling by month
yearly_data = df3.resample('Y').sum()  # Resampling by year

# Plotting
plt.figure(figsize=(14, 10))

# Daily incidents line chart
plt.subplot(2, 2, 1)
plt.plot(daily_data.index, daily_data['count'], marker='o', linestyle='-')
plt.title('Incidents per Day')
plt.xlabel('Date')
plt.ylabel('Number of Incidents')

# Weekly incidents line chart
plt.subplot(2, 2, 2)
plt.plot(weekly_data.index, weekly_data['count'], marker='o', linestyle='-')
plt.title('Incidents per Week')
plt.xlabel('Date')
plt.ylabel('Number of Incidents')

# Monthly incidents line chart
plt.subplot(2, 2, 3)
plt.plot(monthly_data.index, monthly_data['count'], marker='o', linestyle='-')
plt.title('Incidents per Month')
plt.xlabel('Date')
plt.ylabel('Number of Incidents')

# Yearly incidents line chart
plt.subplot(2, 2, 4)
plt.plot(yearly_data.index, yearly_data['count'], marker='o', linestyle='-')
plt.title('Incidents per Year')
plt.xlabel('Date')
plt.ylabel('Number of Incidents')

# Adjust layout and show the plots
plt.tight_layout()
plt.show()
C:\Users\anime\AppData\Local\Temp\ipykernel_21540\4128195198.py:10: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  monthly_data = df3.resample('M').sum()  # Resampling by month
C:\Users\anime\AppData\Local\Temp\ipykernel_21540\4128195198.py:11: FutureWarning: 'Y' is deprecated and will be removed in a future version, please use 'YE' instead.
  yearly_data = df3.resample('Y').sum()  # Resampling by year
No description has been provided for this image
In [ ]:
# Creating the above charts as interactive charts to get more detailed analysis
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
In [ ]:
df3 = df2.copy()
In [ ]:
# Convert the 'Date' column to datetime format
df3['Date'] = pd.to_datetime(df3['Date'])

# Add a 'count' column to represent the occurrence of an incident
df3['count'] = 1  # Assuming each row represents one incident

# Set the 'Date' column as the DataFrame index
df3.set_index('Date', inplace=True)

# Resample data by day, week, month, and year to count incidents
daily_data = df3.resample('D').sum()  # Resampling by day
weekly_data = df3.resample('W').sum()  # Resampling by week
monthly_data = df3.resample('M').sum()  # Resampling by month
yearly_data = df3.resample('Y').sum()  # Resampling by year

# Apply the function to create the 'Season' column based on Brazil's climatology
df3['Month'] = df3.index.month
df3['Season'] = df3['Month'].apply(get_season_brazil)

# Group by season to count incidents
season_data = df3.groupby('Season')['count'].sum().reindex(['Summer', 'Autumn', 'Winter', 'Spring'])

# Create individual charts for each time period

# Daily incidents line chart
fig_daily = go.Figure()
fig_daily.add_trace(go.Scatter(x=daily_data.index, y=daily_data['count'], mode='lines+markers', name='Daily'))
fig_daily.update_layout(
    title='Incidents per Day',
    xaxis_title='Date',
    yaxis_title='Number of Incidents',
    hovermode='x unified'
)
fig_daily.show()

# Weekly incidents line chart
fig_weekly = go.Figure()
fig_weekly.add_trace(go.Scatter(x=weekly_data.index, y=weekly_data['count'], mode='lines+markers', name='Weekly'))
fig_weekly.update_layout(
    title='Incidents per Week',
    xaxis_title='Date',
    yaxis_title='Number of Incidents',
    hovermode='x unified'
)
fig_weekly.show()

# Monthly incidents line chart
fig_monthly = go.Figure()
fig_monthly.add_trace(go.Scatter(x=monthly_data.index, y=monthly_data['count'], mode='lines+markers', name='Monthly'))
fig_monthly.update_layout(
    title='Incidents per Month',
    xaxis_title='Date',
    yaxis_title='Number of Incidents',
    hovermode='x unified'
)
fig_monthly.show()

# Yearly incidents line chart
fig_yearly = go.Figure()
fig_yearly.add_trace(go.Scatter(x=yearly_data.index, y=yearly_data['count'], mode='lines+markers', name='Yearly'))
fig_yearly.update_layout(
    title='Incidents per Year',
    xaxis_title='Date',
    yaxis_title='Number of Incidents',
    hovermode='x unified'
)
fig_yearly.show()
C:\Users\anime\AppData\Local\Temp\ipykernel_21540\174771619.py:13: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  monthly_data = df3.resample('M').sum()  # Resampling by month
C:\Users\anime\AppData\Local\Temp\ipykernel_21540\174771619.py:14: FutureWarning: 'Y' is deprecated and will be removed in a future version, please use 'YE' instead.
  yearly_data = df3.resample('Y').sum()  # Resampling by year

Insights:

  • There is no clear trend observed at daily and weekly level, however, there is noticeable trend in incidents per month.
  • The number of incidents per months has a slight declining trend as the year progresses from February to December.
  • There is an increase in number of incidents observed from January to February for both the years.
  • We have a full year data for 2016 and half year data for 2017, therefore there is a overall declining trend in number of incidents shown in the year level trend chart.

More insights on Monthly accident counts trend chart:

  • The data spans from January 2016 to July 2017, covering about 1.5 years
  • The number of accidents fluctuates significantly from month to month
  • Highest Count: The month with the highest accident count is March 2016 with 34 accidents. This suggests a period of increased incidents, which could be due to various factors such as operational changes, seasonal effects, or specific events affecting safety
  • Lowest Count: The lowest accident count occurs in July 2017, with only 5 accidents. This drop might indicate improvements in safety measures or operational changes leading to fewer incidents
In [ ]:
df3 = df2.copy()
In [ ]:
# Creating trend chart on number of incidents as per the seasons.
# Convert the 'Date' column to datetime format
df3['Date'] = pd.to_datetime(df3['Date'])

# Add a 'count' column to represent the occurrence of an incident
df3['count'] = 1  # Assuming each row represents one incident

# Set the 'Date' column as the DataFrame index
df3.set_index('Date', inplace=True)

# Function to determine the season based on Brazil's climatology
def get_season_brazil(month):
    if month in [12, 1, 2]:
        return 'Summer'
    elif month in [3, 4, 5]:
        return 'Autumn'
    elif month in [6, 7, 8]:
        return 'Winter'
    elif month in [9, 10, 11]:
        return 'Spring'

# Extract Year and Month, and create a 'Season' column
df3['Year'] = df3.index.year
df3['Month'] = df3.index.month
df3['Season'] = df3['Month'].apply(get_season_brazil)

# Group by both Year and Season, summing the incident counts
season_data_by_year = df3.groupby(['Year', 'Season'])['count'].sum().reset_index()

# Ensure that the Season order is preserved as Summer, Autumn, Winter, Spring
season_order = ['Summer', 'Autumn', 'Winter', 'Spring']
season_data_by_year['Season'] = pd.Categorical(season_data_by_year['Season'], categories=season_order, ordered=True)

# Create a bar chart showing incidents by season and year
fig_season_year = px.bar(
    season_data_by_year,
    x='Season',
    y='count',
    color='Season',
    barmode='group',
    facet_col='Year',  # This will create a separate plot for each year
    title='Incidents by Season and Year (Brazil Climatology)',
    text='count',
    color_discrete_map={
        'Summer': 'orange', 'Autumn': 'brown', 'Winter': 'blue', 'Spring': 'green'
    }
)

# Update the layout of the seasonal chart
fig_season_year.update_layout(
    xaxis_title='Season',
    yaxis_title='Number of Incidents',
    hovermode='x'
)

# Show the updated seasonal chart
fig_season_year.show()

Insights:

  • Autumn and Winter seasons seems to be having more number of incidents compared to spring and summer season.
  • It indicates there are more incidents in cooler weather compared to a warmer weather.
In [ ]:
# Reviewing the data
df3.head(2)
Out[ ]:
Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description Year Month Day Weekday Week of the Year Season count
Date
2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f... 2016 1 1 4 53 Summer 1
2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum... 2016 1 2 5 53 Summer 1
In [ ]:
# Regaining the index back
df3.reset_index(inplace = True)
df3.head()
Out[ ]:
Date Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description Year Month Day Weekday Week of the Year Season count
0 2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f... 2016 1 1 4 53 Summer 1
1 2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum... 2016 1 2 5 53 Summer 1
2 2016-01-06 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170... 2016 1 6 2 1 Summer 1
3 2016-01-08 Country_01 Local_04 Mining I I Male Third Party Others Being 9:45 am. approximately in the Nv. 1880 C... 2016 1 8 4 1 Summer 1
4 2016-01-10 Country_01 Local_04 Mining IV IV Male Third Party Others Approximately at 11:45 a.m. in circumstances t... 2016 1 10 6 1 Summer 1

EDA using bi-variate and multi-variate analysis:¶

Accident level by country

In [ ]:
# Accident level by country
plt.figure(figsize=(10, 6))
sns.countplot(x='Accident Level', hue='Country', data=df3)
plt.title('Accident Level by Country')
plt.show()
No description has been provided for this image

Insights

  • Country_01 reports the highest number of Level I accidents compared to other countries, showing a high volume of less severe incidents

  • Country_02 also shows a high number of Level I accidents but fewer than Country_01

  • For more severe levels (II to V), the distribution is more balanced among the countries, suggesting that while minor accidents are common in Country_01, more severe incidents are spread evenly across all three countries

Bivariate Analysis: Accident level by critical risk

In [ ]:
# Accident level by critical risk

# Optional: Filter the dataset to show only the top N critical risks (if applicable)
top_critical_risks = df3['Critical Risk'].value_counts().nlargest(10).index
filtered_df = df3[df3['Critical Risk'].isin(top_critical_risks)]

# Accident level by critical risk with adjusted figure size and bar width
plt.figure(figsize=(16, 10))  # Increased figure size
sns.countplot(x='Accident Level', hue='Critical Risk', data=filtered_df, palette='Set2', width=0.8)
plt.title('Accident Level Distribution by Critical Risk (Top 10)', fontsize=18)
plt.xlabel('Accident Level', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.xticks(rotation=45, fontsize=14)  # Adjusted rotation and font size
plt.legend(title='Critical Risk', bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=12)
plt.tight_layout()  # Adjust layout to fit everything
plt.show()
No description has been provided for this image

Insights:

  • Accident Level I dominates, particularly in the "Others" category, with over 160 occurrences, which is significantly higher than all other risks

  • Minor Representation: Other risk categories such as "Pressed," "Manual Tools," and "Vehicles and Mobile Equipment" have a scattered distribution, mostly clustered in the lower accident levels

  • Other Accident Levels: Levels III and IV show some concentration in the "Others" risk category, but their occurrence is far less compared to Level I

  • Uncommon Risks: Categories like "Venomous Animals" and "Bees" appear infrequently across all accident levels

Bivariate Analysis - Target Variable - Accident Level and Potential Accident Level

In [ ]:
# Accident Severity Analysis (Accident Level vs Potential Accident Level): This is to analyze if the accidents are mostly minor ( Level 1-II) or
# are severe accidents ( V-VI)

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd # Import pandas for data manipulation

# Create a contingency table
contingency_table = pd.crosstab(df3['Accident Level'], df3['Potential Accident Level'])

# Reset index to convert the multi-index to columns
contingency_table = contingency_table.reset_index()

# Melt the table to create a 'Count' column for the heatmap
comparison = contingency_table.melt(id_vars=['Accident Level'], var_name='Potential Accident Level', value_name='Count')


# Pivot the data for a heatmap
# Use 'comparison' which now contains the necessary data
comparison_pivot = comparison.pivot(index="Accident Level", columns="Potential Accident Level", values="Count")

# Plot the heatmap
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
sns.heatmap(comparison_pivot, annot=True, cmap="YlGnBu", fmt='g')
plt.title('Actual Accident Level vs. Potential Accident Level')
plt.xlabel('Potential Accident Level')
plt.ylabel('Accident Level')
plt.show()

# Violin Plot to show the distribution
plt.figure(figsize=(10, 6))
sns.violinplot(data=df3, x='Accident Level', y='Potential Accident Level')
plt.title('Violin Plot of Potential Accident Level by Actual Accident Level')
plt.xlabel('Actual Accident Level')
plt.ylabel('Potential Accident Level')
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image
No description has been provided for this image

Insights:

Distribution of Accidents:

  • The table above shows a range of accidents categorized by their actual severity (Accident Level) and potential severity (Potential Accident Level)
  • Accident Level I has the highest occurrence, especially with potential severity levels up to IV. This indicates that while many accidents are classified as minor (I), they have the potential to escalate into more serious incidents if not properly managed

Potential Severity:

  • There are 88 incidents classified as I but with a potential severity of II, and 89 incidents with potential severity of III. This suggests that while the incidents were not severe at the time, there is a risk of them leading to more severe outcomes if conditions change or safety measures are not followed

Multi-variate Analysis: Industry Sectors, Countries and Accident levels:

Derive which countries or industry sectors have higher accident rates and which ones frequently encounter severe accidents

In [ ]:
# 1. High-Risk Sectors or Regions:
# Insight to derive: Which countries or industry sectors have higher accident rates and
# which ones frequently encounter severe accidents.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Group data by Countries and Industry Sector
country_sector_group = df3.groupby(['Country', 'Industry Sector']).size().reset_index(name='Accident Count')

# Display grouped data
print(country_sector_group.head())
      Country Industry Sector  Accident Count
0  Country_01          Metals              46
1  Country_01          Mining             200
2  Country_01          Others               2
3  Country_02          Metals              88
4  Country_02          Mining              37
In [ ]:
# Visualize Accident Count by Countries and Industry Sectors

# Pivot the data for a heatmap
pivot_data = country_sector_group.pivot(index="Country", columns="Industry Sector", values="Accident Count") # Changed to keyword arguments

# Plot a heatmap of accident counts across countries and industry sectors
plt.figure(figsize=(10,6))
sns.heatmap(pivot_data, annot=True, cmap="Blues")
plt.title('Accident Count by Countries and Industry Sector')
plt.show()

# Analyze Accident Severity Distribution

# Group data by Countries, Industry sector, and Accident level
severity_group = df3.groupby(['Country', 'Industry Sector', 'Accident Level']).size().reset_index(name='Count')

# Pivot the data for plotting
severity_pivot = severity_group.pivot_table(index=['Country', 'Industry Sector'], columns='Accident Level', values='Count', fill_value=0)

# Plot stacked bar chart for accident severity
severity_pivot.plot(kind='bar', stacked=True, figsize=(12, 7), colormap='coolwarm')
plt.title('Accident Severity Distribution by Countries and Industry Sector')
plt.ylabel('Number of Accidents')
plt.show()

# Identify High-Risk Plants to analyze specific plants where the accident severity is consistently higher

# Group data by Countries, Local (plant), and Accident level
plant_risk_group = df3.groupby(['Country', 'City', 'Accident Level']).size().reset_index(name='Count')

# Find plants with high accident severity (Accident level >= IV)
high_risk_plants = plant_risk_group[plant_risk_group['Accident Level'] >= 'IV']

# # Display high-risk plants
# print(high_risk_plants)
No description has been provided for this image
No description has been provided for this image

Insights:

Mining Sector: The Mining industry has the highest number of accidents in country_01, with a total of 237 accidents (200 from Country_01 and 37 from Country_02). This suggests that mining operations, especially in Country_01, face significant safety challenges.

Metals Sector: The Metals industry also has the highest number of accidents in country_02, with a combined total of 134 accidents for country_01 and country_02. Interestingly, Country_02 has almost double the accident count (88) compared to Country_01 (46), indicating that Country_02's metals industry may require more attention to safety practices.

Others Sector: There are 47 accidents in the Others category, and the majority of them occurred in Country_03. This indicates that this sector has more operations in country_03 and requires more attention to safety practices in country_03.

  • In all the industry sectors across all the countries most of the accidents are categorized as 1.
  • Accident level 5 are found in only metal and mining industry sector of country_01. This indicates that more severe accidents are most likely to happen in country_01 across metal and mining industries.

Analyze the Description Column - To know what is the most common word

In [ ]:
import pandas as pd
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import nltk

# Make sure to download NLTK resources if not already installed
nltk.download('punkt')
nltk.download('stopwords')

# Load the data (assuming the dataset is already loaded as 'data')
# Extract the 'Description' column
descriptions = df3['Description'].dropna()

# Preprocess the text
def preprocess_text(text):
    # Tokenize text
    tokens = word_tokenize(text.lower())

    # Remove punctuation
    tokens = [word for word in tokens if word.isalpha()]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    return tokens

# Apply preprocessing to all descriptions
all_words = []
for description in descriptions:
    tokens = preprocess_text(description)
    all_words.extend(tokens)

# Get the frequency of each word
word_freq = Counter(all_words)

# Convert the word frequencies to a DataFrame for better readability
word_freq_df = pd.DataFrame(word_freq.most_common(), columns=['Word', 'Frequency'])

# Display the top 20 most frequent words
print(word_freq_df.head(20))
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\anime\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\anime\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
            Word  Frequency
0        causing        166
1           hand        163
2       employee        156
3           left        155
4          right        154
5       operator        126
6         injury        104
7           time        101
8       activity         91
9           area         80
10        moment         78
11     equipment         76
12          work         76
13      accident         73
14  collaborator         71
15         level         70
16     assistant         68
17        finger         68
18        worker         67
19          pipe         67
In [ ]:
!pip install wordcloud
Requirement already satisfied: wordcloud in c:\users\anime\anaconda3\lib\site-packages (1.9.3)
Requirement already satisfied: numpy>=1.6.1 in c:\users\anime\anaconda3\lib\site-packages (from wordcloud) (1.26.4)
Requirement already satisfied: pillow in c:\users\anime\anaconda3\lib\site-packages (from wordcloud) (10.3.0)
Requirement already satisfied: matplotlib in c:\users\anime\appdata\roaming\python\python311\site-packages (from wordcloud) (3.7.3)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\anime\anaconda3\lib\site-packages (from matplotlib->wordcloud) (1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\users\anime\anaconda3\lib\site-packages (from matplotlib->wordcloud) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\anime\anaconda3\lib\site-packages (from matplotlib->wordcloud) (4.51.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\anime\anaconda3\lib\site-packages (from matplotlib->wordcloud) (1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\users\anime\anaconda3\lib\site-packages (from matplotlib->wordcloud) (23.2)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\anime\anaconda3\lib\site-packages (from matplotlib->wordcloud) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\anime\anaconda3\lib\site-packages (from matplotlib->wordcloud) (2.8.2)
Requirement already satisfied: six>=1.5 in c:\users\anime\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pycodestyle-2.10.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pycosat-0.6.4.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pyflakes-3.0.1.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pylint_venv-2.3.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pyodbc-4.0.34.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pyrsistent-0.18.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pytz-2023.3.post1.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pywin32_ctypes-0.2.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\PyYAML-6.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pyzmq-23.2.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\queuelib-1.5.0-py3.11.egg-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\regex-2022.7.9.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\safetensors-0.3.2.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pycodestyle-2.10.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pycosat-0.6.4.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pyflakes-3.0.1.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pylint_venv-2.3.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pyodbc-4.0.34.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pyrsistent-0.18.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pytz-2023.3.post1.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pywin32_ctypes-0.2.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\PyYAML-6.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pyzmq-23.2.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\queuelib-1.5.0-py3.11.egg-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\regex-2022.7.9.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\safetensors-0.3.2.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\sniffio-1.2.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\soupsieve-2.4.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\tabulate-0.8.10.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\tokenizers-0.13.2.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\tornado-6.3.2.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\typing_extensions-4.7.1.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\widgetsnbextension-4.0.5.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\xxhash-2.0.2.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\y_py-0.5.9.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\zipp-3.11.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pycodestyle-2.10.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pycosat-0.6.4.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pyflakes-3.0.1.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pylint_venv-2.3.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pyodbc-4.0.34.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pyrsistent-0.18.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pytz-2023.3.post1.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pywin32_ctypes-0.2.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\PyYAML-6.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pyzmq-23.2.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\queuelib-1.5.0-py3.11.egg-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\regex-2022.7.9.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\safetensors-0.3.2.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\sniffio-1.2.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\soupsieve-2.4.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\tabulate-0.8.10.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\tokenizers-0.13.2.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\tornado-6.3.2.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\typing_extensions-4.7.1.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\widgetsnbextension-4.0.5.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\xxhash-2.0.2.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\y_py-0.5.9.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\zipp-3.11.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pycodestyle-2.10.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pycosat-0.6.4.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pyflakes-3.0.1.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pylint_venv-2.3.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pyodbc-4.0.34.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pyrsistent-0.18.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pytz-2023.3.post1.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pywin32_ctypes-0.2.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\PyYAML-6.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\pyzmq-23.2.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\queuelib-1.5.0-py3.11.egg-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\regex-2022.7.9.dist-info due to invalid metadata entry 'name'
WARNING: Skipping C:\Users\anime\anaconda3\Lib\site-packages\safetensors-0.3.2.dist-info due to invalid metadata entry 'name'
In [ ]:
from wordcloud import WordCloud

# Combine all descriptions into one text
all_descriptions = " ".join(df3['Description'].dropna())

# Create the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_descriptions)

# Display the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
No description has been provided for this image

Insights:

Body Parts Involved in Injuries:

  • Words like "hand" (177), "left" (155), "right" (154), and "finger" (76) indicate that injuries frequently involve hands and fingers. This suggests that many accidents occur during manual tasks or handling of equipment, leading to injuries in these body parts

Human Factors:

  • "Employee" (172), "operator" (132), "worker" (84), and "assistant" (75) are common, indicating that accidents frequently involve people working directly on machinery or in high-risk environments
  • The presence of "collaborator" (81) suggests accidents often involve teamwork or multiple individuals being present, which may require enhanced safety protocols during collaborative task

Causes of Accidents:

  • "Causing" (166) and "activity" (117) highlight that incidents are typically linked to specific tasks or actions performed by workers. This could point to potential procedural errors or unsafe work practices during certain activities
  • Words like "equipment" (76) and "pipe" (71) show that machinery and industrial equipment are common factors in these accidents, suggesting the need for better equipment maintenance, training, or handling practices

Time-Related Elements:

  • Words like "time" (112) and "moment" (101) indicate that the timing or specific moments during tasks are often highlighted in accident reports. This might suggest that accidents occur due to rushed actions, critical moments, or lapses in concentration during certain time-sensitive tasks

Work Environment:

  • The word "area" (80) suggests that the location or environment where the accident occurred plays a role in these incidents. It could indicate that certain workspaces or zones are more prone to accidents and might require additional safety measures
  • "Level" (70) could refer to specific floors, heights, or operational levels in the workplace, pointing to possible risks related to vertical tasks, elevated work areas, or levels of equipment operation

Accident Severity and Nature:

  • "Injury" (110) and "accident" (73) confirm that the descriptions focus heavily on the consequences of the incidents, with emphasis on the physical harm caused

EDA Summary¶

Country

Country_01 has the highest incidents (59.33%), followed by Country_02 (30.86%) and Country_03 (9.81%).

City

Local_03 accounts for the largest share of incidents (21.29%), while other cities like Local_05 and Local_01 also show notable counts. Local_12, Local_09 and Local_11 cities has the least number of accidents.

Industry Sector

Mining dominates with 56.7% of incidents, while Metals accounts for 32.06% and Others for 11.24%.

Accident Level Most incidents have accident level I with 73.92% of incidents, followed by level II,III, IV and V. As the accidents levels are increasing the number on accidents are decreasing.

Potential Accident Level

Most incidents are in the highest potential accident level IV with 33.73% of incidents, followed by level III at 25.36%.

Gender

Males are overwhelmingly involved in incidents, representing 94.74% of cases.

Employee Type

Third parties represent 44.26% of incidents, closely followed by employees (42.58%), and third-party remote workers (13.16%).

Critical Risk

The 'Others' category dominates, followed by smaller contributions from risks like "Pressed," "Manual Tools," and "Chemical Substances."

Incidents per Day/Week/Month/Year/Season

The data spans from January 2016 to July 2017, covering about 1.5 years. Incidents fluctuate over time, peaking at certain points in months and years, with steady trends over weeks. The number of incidents per month has a slight declining trend as the year progresses from February to December. More incidents pccured in cooler weather compared to a warmer weather.

Accident Level by Country

Country_01 shows the highest accident level I incidents, followed by lower counts in other levels for all countries.

Accident Level by Critical Risk

Level I incidents dominate across all critical risks, with a few cases in Levels II-V across various risk types.

Accident Level and Potential Accident Level Accident Level I has the highest occurrence, especially with potential severity levels up to IV. This indicates that while many accidents are classified as minor (I), they have the potential to escalate into more serious incidents if not properly managed

Industry Sectors, Countries and accident levels: The Mining industry has the highest number of accidents in country_01. The Metals industry has the highest number of accidents in country_02. Majority of Others category sector accident occurred in Country_03. In all the industry sectors across all the countries most of the accidents are categorized as 1.

Word Cloud: Words like "hand" (177), "left" (155), "right" (154), and "finger" (76) indicate that injuries frequently involve hands and fingers. This suggests that many accidents occur during manual tasks or handling of equipment, leading to injuries in these body parts.

Step 3: NLP Pre-processing¶

In [ ]:
df3.head()
Out[ ]:
Date Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description Year Month Day Weekday Week of the Year Season count
0 2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f... 2016 1 1 4 53 Summer 1
1 2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum... 2016 1 2 5 53 Summer 1
2 2016-01-06 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170... 2016 1 6 2 1 Summer 1
3 2016-01-08 Country_01 Local_04 Mining I I Male Third Party Others Being 9:45 am. approximately in the Nv. 1880 C... 2016 1 8 4 1 Summer 1
4 2016-01-10 Country_01 Local_04 Mining IV IV Male Third Party Others Approximately at 11:45 a.m. in circumstances t... 2016 1 10 6 1 Summer 1
In [ ]:
# Displaying the complete text of the first five descriptions in the dataset
df.head()['Description'].values
Out[ ]:
array(['While removing the drill rod of the Jumbo 08 for maintenance, the supervisor proceeds to loosen the support of the intermediate centralizer to facilitate the removal, seeing this the mechanic supports one end on the drill of the equipment to pull with both hands the bar and accelerate the removal from this, at this moment the bar slides from its point of support and tightens the fingers of the mechanic between the drilling bar and the beam of the jumbo.',
       'During the activation of a sodium sulphide pump, the piping was uncoupled and the sulfide solution was designed in the area to reach the maid. Immediately she made use of the emergency shower and was directed to the ambulatory doctor and later to the hospital. Note: of sulphide solution = 48 grams / liter.',
       'In the sub-station MILPO located at level +170 when the collaborator was doing the excavation work with a pick (hand tool), hitting a rock with the flat part of the beak, it bounces off hitting the steel tip of the safety shoe and then the metatarsal area of \u200b\u200bthe left foot of the collaborator causing the injury.',
       'Being 9:45 am. approximately in the Nv. 1880 CX-695 OB7, the personnel begins the task of unlocking the Soquet bolts of the BHB machine, when they were in the penultimate bolt they identified that the hexagonal head was worn, proceeding Mr. Cristóbal - Auxiliary assistant to climb to the platform to exert pressure with your hand on the "DADO" key, to prevent it from coming out of the bolt; in those moments two collaborators rotate with the lever in anti-clockwise direction, leaving the key of the bolt, hitting the palm of the left hand, causing the injury.',
       'Approximately at 11:45 a.m. in circumstances that the mechanics Anthony (group leader), Eduardo and Eric Fernández-injured-the three of the Company IMPROMEC, performed the removal of the pulley of the motor of the pump 3015 in the ZAF of Marcy. 27 cm / Length: 33 cm / Weight: 70 kg), as it was locked proceed to heating the pulley to loosen it, it comes out and falls from a distance of 1.06 meters high and hits the instep of the right foot of the worker, causing the injury described.'],
      dtype=object)
In [ ]:
# Displaying the complete text of the last five descriptions in the dataset
df.tail()['Description'].values
Out[ ]:
array(['Being approximately 5:00 a.m. approximately, when lifting the Kelly HQ towards the pulley of the frame to align it, the assistant Marco that is in the later one is struck the hand against the frame generating the injury.',
       'The collaborator moved from the infrastructure office (Julio to the toilets, when the pin of the right shoe is hooked on the bra of the left shoe causing not to take the step and fall untimely, causing injury described.',
       'During the environmental monitoring activity in the area, the employee was surprised by a swarming swarm of weevils. During the exit of the place, endured suffering two stings, being one in the face and the other in the middle finger of the left hand.',
       'The Employee performed the activity of stripping cathodes, when pulling the cathode sheet his hand hit the side of another cathode, causing a blunt cut on his 2nd finger of the left hand.',
       'At 10:00 a.m., when the assistant cleaned the floor of module "E" in the central camp, she slipped back and immediately grabbed the laundry table to avoid falling to the floor; suffering the described injury.'],
      dtype=object)

Insights and NLP Pre-processing techniques to be performed:

  • Lowercasing: There are capital letters, we can covert all the text to lower case.
  • Puncuation Removal: All the unnecessary punctuations can be removed.
  • Stopwords Removal: There are stop words like "the", "of", "to" etc. We can remove them.
  • Removing Special Characters: We can clean up all non-alphanumeric characters (except for words).
  • Tokenization: Break sentences into individual words (tokens).
  • Lemmatization: We can reduce a word to its root by using lemmatization. We would prefer to do lemmatization over stemming to get the actual word and not just the trucated part of the word. Lemmatization also gives more context to chatbot conversations as it recognizes words based on their exact and contextual meaning.
  • Removing Numbers: Remove any numeric values if not relevant.
In [ ]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download necessary resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\anime\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\anime\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\anime\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Out[ ]:
True
In [ ]:
# Initialize stopwords and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_description(text):
    """
    Preprocess the text by performing NLP pre-processing steps.
    """
    # 1. Lowercasing
    text = text.lower()

    # 2. Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)

    # 3. Tokenize the text
    tokens = word_tokenize(text)

    # 4. Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]

    # 5. Lemmatize the tokens (optional: use stemming instead if preferred)
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    # 6. Remove numbers
    tokens = [word for word in tokens if not word.isdigit()]

    # Join tokens back into a string
    return ' '.join(tokens)

# Apply pre-processing to the "Description" column
df3['Preprocessed_Description'] = df3['Description'].apply(preprocess_description)

# Display the preprocessed descriptions
df3[['Description', 'Preprocessed_Description']].head()
Out[ ]:
Description Preprocessed_Description
0 While removing the drill rod of the Jumbo 08 f... removing drill rod jumbo maintenance superviso...
1 During the activation of a sodium sulphide pum... activation sodium sulphide pump piping uncoupl...
2 In the sub-station MILPO located at level +170... substation milpo located level collaborator ex...
3 Being 9:45 am. approximately in the Nv. 1880 C... approximately nv cx695 ob7 personnel begin tas...
4 Approximately at 11:45 a.m. in circumstances t... approximately circumstance mechanic anthony gr...
In [ ]:
df3.head()[['Description','Preprocessed_Description']].values
Out[ ]:
array([['While removing the drill rod of the Jumbo 08 for maintenance, the supervisor proceeds to loosen the support of the intermediate centralizer to facilitate the removal, seeing this the mechanic supports one end on the drill of the equipment to pull with both hands the bar and accelerate the removal from this, at this moment the bar slides from its point of support and tightens the fingers of the mechanic between the drilling bar and the beam of the jumbo.',
        'removing drill rod jumbo maintenance supervisor proceeds loosen support intermediate centralizer facilitate removal seeing mechanic support one end drill equipment pull hand bar accelerate removal moment bar slide point support tightens finger mechanic drilling bar beam jumbo'],
       ['During the activation of a sodium sulphide pump, the piping was uncoupled and the sulfide solution was designed in the area to reach the maid. Immediately she made use of the emergency shower and was directed to the ambulatory doctor and later to the hospital. Note: of sulphide solution = 48 grams / liter.',
        'activation sodium sulphide pump piping uncoupled sulfide solution designed area reach maid immediately made use emergency shower directed ambulatory doctor later hospital note sulphide solution gram liter'],
       ['In the sub-station MILPO located at level +170 when the collaborator was doing the excavation work with a pick (hand tool), hitting a rock with the flat part of the beak, it bounces off hitting the steel tip of the safety shoe and then the metatarsal area of \u200b\u200bthe left foot of the collaborator causing the injury.',
        'substation milpo located level collaborator excavation work pick hand tool hitting rock flat part beak bounce hitting steel tip safety shoe metatarsal area left foot collaborator causing injury'],
       ['Being 9:45 am. approximately in the Nv. 1880 CX-695 OB7, the personnel begins the task of unlocking the Soquet bolts of the BHB machine, when they were in the penultimate bolt they identified that the hexagonal head was worn, proceeding Mr. Cristóbal - Auxiliary assistant to climb to the platform to exert pressure with your hand on the "DADO" key, to prevent it from coming out of the bolt; in those moments two collaborators rotate with the lever in anti-clockwise direction, leaving the key of the bolt, hitting the palm of the left hand, causing the injury.',
        'approximately nv cx695 ob7 personnel begin task unlocking soquet bolt bhb machine penultimate bolt identified hexagonal head worn proceeding mr cristóbal auxiliary assistant climb platform exert pressure hand dado key prevent coming bolt moment two collaborator rotate lever anticlockwise direction leaving key bolt hitting palm left hand causing injury'],
       ['Approximately at 11:45 a.m. in circumstances that the mechanics Anthony (group leader), Eduardo and Eric Fernández-injured-the three of the Company IMPROMEC, performed the removal of the pulley of the motor of the pump 3015 in the ZAF of Marcy. 27 cm / Length: 33 cm / Weight: 70 kg), as it was locked proceed to heating the pulley to loosen it, it comes out and falls from a distance of 1.06 meters high and hits the instep of the right foot of the worker, causing the injury described.',
        'approximately circumstance mechanic anthony group leader eduardo eric fernándezinjuredthe three company impromec performed removal pulley motor pump zaf marcy cm length cm weight kg locked proceed heating pulley loosen come fall distance meter high hit instep right foot worker causing injury described']],
      dtype=object)

Step 4: Data preparation - Cleansed data in .xlsx or .csv file¶

In [ ]:
# Save the cleaned dataset as a .csv or .xlsx file
df3.to_csv('cleaned_industrial_safety_data.csv', index=False)

Step 5: Design train and test basic machine learning classifiers¶

In [ ]:
# Reading back the cleaned csv data created in previous steps as Dataframe
df4 = pd.read_csv('cleaned_industrial_safety_data.csv')
df4.head()
Out[ ]:
Date Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description Year Month Day Weekday Week of the Year Season count Preprocessed_Description
0 2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f... 2016 1 1 4 53 Summer 1 removing drill rod jumbo maintenance superviso...
1 2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum... 2016 1 2 5 53 Summer 1 activation sodium sulphide pump piping uncoupl...
2 2016-01-06 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170... 2016 1 6 2 1 Summer 1 substation milpo located level collaborator ex...
3 2016-01-08 Country_01 Local_04 Mining I I Male Third Party Others Being 9:45 am. approximately in the Nv. 1880 C... 2016 1 8 4 1 Summer 1 approximately nv cx695 ob7 personnel begin tas...
4 2016-01-10 Country_01 Local_04 Mining IV IV Male Third Party Others Approximately at 11:45 a.m. in circumstances t... 2016 1 10 6 1 Summer 1 approximately circumstance mechanic anthony gr...

Deciding Suitable Models¶

We will use the below model to classify accident levels based on incident description:

  • Naive Bayes
  • Logistic Regression
  • SVM
  • Random Forest
  • Gradient Boosting

For multi-class classification of accident severity based on accident descriptions using TF-IDF text features, it's wise to start with models like Naive Bayes, Logistic Regression, SVM, Random Forest, and GradientBoostingClassifier. Here’s why each model is a good choice:

  1. Naive Bayes:

We can choose it because it is computationally efficient, simple, and well-suited for high-dimensional and sparse data like TF-IDF features. It performs well in text classification tasks due to its probabilistic approach.

Use Case Fit: Handles multi-class problems effectively with robust performance even when the independence assumption of features isn’t strictly met.

  1. Logistic Regression:

We can choose it because It’s a straightforward and interpretable model that works well with TF-IDF features. Regularization techniques can help prevent overfitting, making it reliable.

Use Case Fit: Logistic regression is linear, handles multi-class scenarios with ease (via One-vs-Rest), and is good for understanding feature importance.

  1. Support Vector Machine (SVM):

We can choose it because SVMs are powerful classifiers that can create complex decision boundaries and work well for text data. They handle high-dimensional spaces effectively and can be tuned with different kernels.

Use Case Fit: Ideal for handling non-linear relationships and distinguishing subtle differences in text data, even with sparse features.

  1. Random Forest:

We can choose it because it's an ensemble method that reduces overfitting by averaging multiple decision trees. It handles complex feature interactions and works well even if some data is missing or noisy.

Use Case Fit: Good for capturing non-linearities and interactions among features, especially when the feature set size grows after creating TF-IDF vectors.

  1. GradientBoostingClassifier:

We can choose it because Gradient boosting is an ensemble method known for its high predictive power. It iteratively improves predictions by combining weak learners, reducing bias, and handling non-linear patterns.

Use Case Fit: Great for fine-tuning and achieving high accuracy on imbalanced and complex datasets when you have enough data to train it efficiently.

  • Summary:

These models offer a mix of simplicity, interpretability, and power. Starting with Naive Bayes will gives us a strong baseline due to its efficiency, while models like Logistic Regression, SVM, Random Forest, and GradientBoosting allow us to explore more complex relationships and interactions in the text data. These models provide a range of trade-offs between computation cost, interpretability, and predictive power, making them a solid starting point for classifying accident severity levels based on text descriptions.

Importing libraries for performing classification using ML models:¶
In [ ]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Classification using base models¶

Classification of "Accident Level" target variable

In [ ]:
# Load and pre-process data (assuming pre-processing has been done as per previous code)
# We will use 'Preprocessed_Description' for features and 'Accident Level' as the target.

# Splitting the dataset into features and target
X = df4['Preprocessed_Description']  # Features (preprocessed descriptions)
y = df4['Accident Level']  # Target (accident levels)

# 1. TF-IDF Vectorization of the pre-processed descriptions
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # Limit to top 1000 features for simplicity
X_tfidf = tfidf_vectorizer.fit_transform(X)

# 2. Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42, stratify=y)

# 3. Train various classification models
models = {
    "Naive_Bayes": MultinomialNB(),
    "Logistic_Regression": LogisticRegression(random_state=1),
    "SVM": SVC(random_state=1),
    "Random_Forest": RandomForestClassifier(random_state=1),
    "Gradient_Boosting": GradientBoostingClassifier(random_state=1)
}

# Initialize an empty list to store classification metrics
metrics_list = []

for model_name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)

    # Make predictions on the test data
    y_pred = model.predict(X_test)

    # Get the classification report as a string
    report_str = classification_report(y_test, y_pred, zero_division=0)

    # Get the classification report as a dictionary
    report_dict = classification_report(y_test, y_pred, output_dict=True, zero_division=0)

    # Extract accuracy, precision, recall, and F1-score (average metrics for each model)
    accuracy = accuracy_score(y_test, y_pred)
    precision = report_dict['weighted avg']['precision']
    recall = report_dict['weighted avg']['recall']
    f1_score = report_dict['weighted avg']['f1-score']

    # Append the metrics to the list
    metrics_list.append({
        'Model': model_name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1_score
    })

    # Print the classification report for the current model
    print(f"Classification Report for {model_name}:")
    print(report_str)
    print("=" * 60)  # Divider for clarity

# Convert the list of metrics into a DataFrame for comparison
metrics_comparison = pd.DataFrame(metrics_list)

# Display the comparison of metrics for each model
print("Comparison of Metrics:")
print(metrics_comparison)

# Optional: Save the comparison to a CSV for later use
# metrics_comparison.to_csv('/mnt/data/model_metrics_comparison.csv', index=False)
Classification Report for Naive_Bayes:
              precision    recall  f1-score   support

           I       0.74      1.00      0.85        62
          II       0.00      0.00      0.00         8
         III       0.00      0.00      0.00         6
          IV       0.00      0.00      0.00         6
           V       0.00      0.00      0.00         2

    accuracy                           0.74        84
   macro avg       0.15      0.20      0.17        84
weighted avg       0.54      0.74      0.63        84

============================================================
Classification Report for Logistic_Regression:
              precision    recall  f1-score   support

           I       0.74      1.00      0.85        62
          II       0.00      0.00      0.00         8
         III       0.00      0.00      0.00         6
          IV       0.00      0.00      0.00         6
           V       0.00      0.00      0.00         2

    accuracy                           0.74        84
   macro avg       0.15      0.20      0.17        84
weighted avg       0.54      0.74      0.63        84

============================================================
Classification Report for SVM:
              precision    recall  f1-score   support

           I       0.74      1.00      0.85        62
          II       0.00      0.00      0.00         8
         III       0.00      0.00      0.00         6
          IV       0.00      0.00      0.00         6
           V       0.00      0.00      0.00         2

    accuracy                           0.74        84
   macro avg       0.15      0.20      0.17        84
weighted avg       0.54      0.74      0.63        84

============================================================
Classification Report for Random_Forest:
              precision    recall  f1-score   support

           I       0.76      1.00      0.86        62
          II       0.50      0.12      0.20         8
         III       0.00      0.00      0.00         6
          IV       0.00      0.00      0.00         6
           V       0.00      0.00      0.00         2

    accuracy                           0.75        84
   macro avg       0.25      0.23      0.21        84
weighted avg       0.61      0.75      0.65        84

============================================================
Classification Report for Gradient_Boosting:
              precision    recall  f1-score   support

           I       0.75      0.92      0.83        62
          II       0.00      0.00      0.00         8
         III       0.00      0.00      0.00         6
          IV       0.25      0.17      0.20         6
           V       0.00      0.00      0.00         2

    accuracy                           0.69        84
   macro avg       0.20      0.22      0.21        84
weighted avg       0.57      0.69      0.62        84

============================================================
Comparison of Metrics:
                 Model  Accuracy  Precision    Recall  F1-Score
0          Naive_Bayes  0.738095   0.544785  0.738095  0.626875
1  Logistic_Regression  0.738095   0.544785  0.738095  0.626875
2                  SVM  0.738095   0.544785  0.738095  0.626875
3        Random_Forest  0.750000   0.605691  0.750000  0.654630
4    Gradient_Boosting  0.690476   0.571429  0.690476  0.624017

Insight:

  • All the above classification models gave almost the similar results, we can try to further improve the model performance by balancing the data and tuning the model

Classification of "Potential Accident Level" target variable

In [ ]:
# Load and pre-process data (assuming pre-processing has been done as per previous code)
# We will use 'Preprocessed_Description' for features and 'Accident Level' as the target.

# Splitting the dataset into features and target
X = df4['Preprocessed_Description']  # Features (preprocessed descriptions)
y = df4['Potential Accident Level']  # Target (Potential accident levels)

# 1. TF-IDF Vectorization of the pre-processed descriptions
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # Limit to top 1000 features for simplicity
X_tfidf = tfidf_vectorizer.fit_transform(X)

# 2. Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42) #, stratify=y)

# 3. Train various classification models
models = {
    "Naive_Bayes": MultinomialNB(),
    "Logistic_Regression": LogisticRegression(random_state=1),
    "SVM": SVC(random_state=1),
    "Random_Forest": RandomForestClassifier(random_state=1),
    "Gradient_Boosting": GradientBoostingClassifier(random_state=1)
}

# Initialize an empty list to store classification metrics
metrics_list = []

for model_name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)

    # Make predictions on the test data
    y_pred = model.predict(X_test)

    # Get the classification report as a string
    report_str = classification_report(y_test, y_pred, zero_division=0)

    # Get the classification report as a dictionary
    report_dict = classification_report(y_test, y_pred, output_dict=True, zero_division=0)

    # Extract accuracy, precision, recall, and F1-score (average metrics for each model)
    accuracy = accuracy_score(y_test, y_pred)
    precision = report_dict['weighted avg']['precision']
    recall = report_dict['weighted avg']['recall']
    f1_score = report_dict['weighted avg']['f1-score']

    # Append the metrics to the list
    metrics_list.append({
        'Model': model_name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1_score
    })

    # Print the classification report for the current model
    print(f"Classification Report for {model_name}:")
    print(report_str)
    print("=" * 60)  # Divider for clarity

# Convert the list of metrics into a DataFrame for comparison
metrics_comparison = pd.DataFrame(metrics_list)

# Display the comparison of metrics for each model
print("Comparison of Metrics:")
print(metrics_comparison)

# Optional: Save the comparison to a CSV for later use
# metrics_comparison.to_csv('/mnt/data/model_metrics_comparison.csv', index=False)
Classification Report for Naive_Bayes:
              precision    recall  f1-score   support

           I       1.00      0.33      0.50         9
          II       0.67      0.35      0.46        17
         III       0.40      0.07      0.12        27
          IV       0.36      0.92      0.52        26
           V       0.00      0.00      0.00         5

    accuracy                           0.42        84
   macro avg       0.48      0.34      0.32        84
weighted avg       0.48      0.42      0.35        84

============================================================
Classification Report for Logistic_Regression:
              precision    recall  f1-score   support

           I       1.00      0.33      0.50         9
          II       0.67      0.47      0.55        17
         III       0.20      0.04      0.06        27
          IV       0.36      0.88      0.51        26
           V       0.00      0.00      0.00         5

    accuracy                           0.42        84
   macro avg       0.45      0.35      0.33        84
weighted avg       0.42      0.42      0.34        84

============================================================
Classification Report for SVM:
              precision    recall  f1-score   support

           I       1.00      0.22      0.36         9
          II       1.00      0.18      0.30        17
         III       0.67      0.07      0.13        27
          IV       0.34      1.00      0.51        26
           V       0.00      0.00      0.00         5

    accuracy                           0.39        84
   macro avg       0.60      0.29      0.26        84
weighted avg       0.63      0.39      0.30        84

============================================================
Classification Report for Random_Forest:
              precision    recall  f1-score   support

           I       1.00      0.33      0.50         9
          II       0.50      0.24      0.32        17
         III       0.45      0.19      0.26        27
          IV       0.39      0.92      0.55        26
           V       0.00      0.00      0.00         5

    accuracy                           0.43        84
   macro avg       0.47      0.34      0.33        84
weighted avg       0.47      0.43      0.37        84

============================================================
Classification Report for Gradient_Boosting:
              precision    recall  f1-score   support

           I       1.00      0.33      0.50         9
          II       0.62      0.47      0.53        17
         III       0.50      0.22      0.31        27
          IV       0.40      0.77      0.53        26
           V       0.20      0.20      0.20         5
          VI       0.00      0.00      0.00         0

    accuracy                           0.45        84
   macro avg       0.45      0.33      0.34        84
weighted avg       0.53      0.45      0.44        84

============================================================
Comparison of Metrics:
                 Model  Accuracy  Precision    Recall  F1-Score
0          Naive_Bayes  0.416667   0.481509  0.416667  0.346911
1  Logistic_Regression  0.416667   0.417584  0.416667  0.343520
2                  SVM  0.392857   0.629699  0.392857  0.300329
3        Random_Forest  0.428571   0.474253  0.428571  0.371751
4    Gradient_Boosting  0.452381   0.528114  0.452381  0.435221

Insights:

  • All the above classification models gave poor results when "Potential Accident Level" when it is used as the target variable instead of "Accident Level", we can finalize "Accident Level" as the target variable for doing the classification.
  • Let's improve the imbalanced data and run all the base and tuned classification model on the balanced data

Model performance Improvement by Balancing data using SMOTE and Hyper-parameter Tuning¶

  • We will use SMOTE technique to balance the training data.
  • The top N most relevant features based on term frequency or TF-IDF score will be selected by using max_features after creating the tfidf vectorization.
  • Model performance improvement will be attempted using hyper-parameter tuning.
In [ ]:
from imblearn.over_sampling import SMOTE
In [ ]:
# Load and pre-process data (assuming pre-processing has been done as per previous code)
# We will use 'Preprocessed_Description' for features and 'Accident Level' as the target.

# Splitting the dataset into features and target
X = df4['Preprocessed_Description']  # Features (preprocessed descriptions)
y = df4['Accident Level']  # Target (accident levels)

# 1. TF-IDF Vectorization of the pre-processed descriptions
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # Limit to top 1000 features for simplicity
X_tfidf = tfidf_vectorizer.fit_transform(X)

# 2. Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42, stratify=y)

# 3. Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
In [ ]:
# checking if the training data is balanced
y_train_smote.value_counts()
Out[ ]:
Accident Level
I      247
II     247
III    247
IV     247
V      247
Name: count, dtype: int64
  • It is observed that the training data is balanced after using the SMOTE technique

For the following part of the code we will build 3 kinds of model for each classifier. Suppose name of the model is "Model Name" then the below naming convention is followed:

  1. "Model_Name": This is the base model of the classifier which is trained on imbalanced cleaned and pre-processed training data.
  2. "Model_Name_smote": This is the base model of the classifier which is trained on balanced training data obtained after SMOTE
  3. "Model_Name_tuned": This is the Hyper-parameter tuned model of the classifier which is trained on balanced training data obtained after SMOTE
1. Naive Bayes¶
In [ ]:
# Naive Base Model on original training data
Naive_Bayes = MultinomialNB()
Naive_Bayes.fit(X_train, y_train)
# Checking train accuracy
model_score_train = Naive_Bayes.score(X_train, y_train)
print('Train Accuracy:', model_score_train)
# Checking test accuracy
model_score_test = Naive_Bayes.score(X_test, y_test)
print('Test Accuracy:', model_score_test)
Train Accuracy: 0.7395209580838323
Test Accuracy: 0.7380952380952381
  • It seems the base Naive Bayes model trained on imbalanced data is not overfitting because there is no significant difference between the train and test accuracy
In [ ]:
from sklearn import metrics
In [ ]:
#predict on test
y_predict = Naive_Bayes.predict(X_test)
# Performance metrics for base model on test data
print(metrics.classification_report(y_test, y_predict))
              precision    recall  f1-score   support

           I       0.74      1.00      0.85        62
          II       0.00      0.00      0.00         8
         III       0.00      0.00      0.00         6
          IV       0.00      0.00      0.00         6
           V       0.00      0.00      0.00         2

    accuracy                           0.74        84
   macro avg       0.15      0.20      0.17        84
weighted avg       0.54      0.74      0.63        84

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

In [ ]:
# Naive bayes Base Model on balanced training data
Naive_Bayes_smote = MultinomialNB()
Naive_Bayes_smote.fit(X_train_smote, y_train_smote)
# Checking train accuracy
model_score_train = Naive_Bayes_smote.score(X_train_smote, y_train_smote)
print('Train Accuracy:', model_score_train)
# Checking test accuracy
model_score_test = Naive_Bayes_smote.score(X_test, y_test)
print('Test Accuracy:', model_score_test)
Train Accuracy: 0.9603238866396762
Test Accuracy: 0.6309523809523809
  • It seems the base Naive Bayes model trained on balanced data is overfitting because there is a significant difference between the train and test accuracy
In [ ]:
#predict on test
y_predict = Naive_Bayes_smote.predict(X_test)
# Performance metrics for base model on test data
print(metrics.classification_report(y_test, y_predict))
              precision    recall  f1-score   support

           I       0.86      0.68      0.76        62
          II       0.22      0.50      0.31         8
         III       0.44      0.67      0.53         6
          IV       0.40      0.33      0.36         6
           V       0.33      0.50      0.40         2

    accuracy                           0.63        84
   macro avg       0.45      0.54      0.47        84
weighted avg       0.72      0.63      0.66        84

  • Use cross validation techniques and apply hyper-parameter tuning techniques to get the best performance of the model.
In [ ]:
from sklearn.model_selection import GridSearchCV
In [ ]:
# grid search
# Choose the type of classifier.
Naive_Bayes_tuned = MultinomialNB()

# Grid of parameters to choose from
parameters = {'alpha': [0.1, 0.5, 1.0, 2.0],  # Smoothing parameter
    'fit_prior': [True, False]
             }

# Type of scoring used to compare parameter combinations
# acc_scorer = metrics.make_scorer(metrics.recall_score)


# Run the grid search
grid_obj = GridSearchCV(Naive_Bayes_tuned, parameters,cv=5) # , scoring=acc_scorer - # by default the scoring metrics would be the accuracy
# Used 5-fold cross validation
grid_obj = grid_obj.fit(X_train_smote, y_train_smote)

# Set the model to the best combination of parameters
Naive_Bayes_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
Naive_Bayes_tuned.fit(X_train_smote, y_train_smote)
Out[ ]:
MultinomialNB(alpha=0.1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MultinomialNB(alpha=0.1)
In [ ]:
# checking best parameters
grid_obj.best_params_
Out[ ]:
{'alpha': 0.1, 'fit_prior': True}
In [ ]:
# Checking train accuracy
model_score_train = Naive_Bayes_tuned.score(X_train_smote, y_train_smote)
print('Train Accuracy:', model_score_train)
# Checking test accuracy
model_score_test = Naive_Bayes_tuned.score(X_test, y_test)
print('Test Accuracy:', model_score_test)
Train Accuracy: 0.9902834008097166
Test Accuracy: 0.7023809523809523
  • It seems the tuned Naive Bayes model is overfitting because there is a significant difference between the train and test accuracy
In [ ]:
#predict on test
y_predict = Naive_Bayes_tuned.predict(X_test)
# Performance metrics for base model on test data
print(metrics.classification_report(y_test, y_predict))
              precision    recall  f1-score   support

           I       0.77      0.90      0.83        62
          II       0.29      0.25      0.27         8
         III       0.00      0.00      0.00         6
          IV       0.50      0.17      0.25         6
           V       0.00      0.00      0.00         2

    accuracy                           0.70        84
   macro avg       0.31      0.26      0.27        84
weighted avg       0.63      0.70      0.66        84

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

In [ ]:
from sklearn import metrics

# Function to calculate different metric scores(Accuracy, Recall, Precision, F1-score) of the model trained on balanced training data
def get_metrics_score(model, flag=True):
    '''
    model : classifier to predict values of X
    x_train_smote : Balanced training feature set
    x_test : Test feature set
    y_train_smote : Balanced training target labels
    y_test : Test target labels
    flag : If True, prints the classification report, accuracy, precision, recall, f1-score

    Returns:
    A list of train and test accuracy, precision, recall, F1-score (weighted average) metrics.
    '''
    # defining an empty list to store train and test results
    score_list = []

    # Predicting on train and test sets. We will use the balanced data
    pred_train = model.predict(X_train_smote)
    pred_test = model.predict(X_test)

    # Accuracy of the model
    train_acc = metrics.accuracy_score(y_train_smote, pred_train)
    test_acc = metrics.accuracy_score(y_test, pred_test)

    # Classification report for weighted average precision, recall, f1-score
    classification_report_train = metrics.classification_report(y_train_smote, pred_train, output_dict=True, zero_division=0)
    classification_report_test = metrics.classification_report(y_test, pred_test, output_dict=True, zero_division=0)

    # Weighted averages

    train_precision = classification_report_train['weighted avg']['precision']
    test_precision = classification_report_test['weighted avg']['precision']

    train_recall = classification_report_train['weighted avg']['recall']
    test_recall = classification_report_test['weighted avg']['recall']

    train_f1_score = classification_report_train['weighted avg']['f1-score']
    test_f1_score = classification_report_test['weighted avg']['f1-score']

    # Append all metrics (train/test accuracy, precision, recall, f1-score)
    score_list.extend([train_acc, test_acc, train_precision, test_precision, train_recall, test_recall, train_f1_score, test_f1_score])

    # If the flag is set to True, print out the classification reports and the metrics
    if flag:
        print("Classification Report (Train):\n", metrics.classification_report(y_train, pred_train, zero_division=0))
        print("Classification Report (Test):\n", metrics.classification_report(y_test, pred_test, zero_division=0))
        print("\nMetrics Summary:")
        print(f"Accuracy on Training set: {train_acc}")
        print(f"Accuracy on Test set: {test_acc}")
        print(f"Precision on Training set (Weighted Avg): {train_precision}")
        print(f"Precision on Test set (Weighted Avg): {test_precision}")
        print(f"Recall on Training set (Weighted Avg): {train_recall}")
        print(f"Recall on Test set (Weighted Avg): {test_recall}")
        print(f"F1-Score on Training set (Weighted Avg): {train_f1_score}")
        print(f"F1-Score on Test set (Weighted Avg): {test_f1_score}")

    return score_list  # returning the list with train and test scores
In [ ]:
# defining list of models for model comparison
models = [Naive_Bayes_smote, Naive_Bayes_tuned]

# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_score_train = []
f1_score_test = []

# looping through all the models to get the accuracy, recall, precision and f-1 scores
for model in models:
    j = get_metrics_score(model,False)
    acc_train.append(np.round(j[0],2))
    acc_test.append(np.round(j[1],2))
    precision_train.append(np.round(j[2],2))
    precision_test.append(np.round(j[3],2))
    recall_train.append(np.round(j[4],2))
    recall_test.append(np.round(j[5],2))
    f1_score_train.append(np.round(j[6],2))
    f1_score_test.append(np.round(j[7],2))

comparison_frame = pd.DataFrame({'Model':['Naive_Bayes_smote', 'Naive_Bayes_tuned'],
                                          'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
                                          'Train_Recall':recall_train,'Test_Recall':recall_test,
                                          'Train_Precision':precision_train,'Test_Precision':precision_test,
                                          'Train_f1_score':f1_score_train,'Test_f1_score':f1_score_test
                                })
comparison_frame
Out[ ]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_f1_score Test_f1_score
0 Naive_Bayes_smote 0.96 0.63 0.96 0.63 0.96 0.72 0.96 0.66
1 Naive_Bayes_tuned 0.99 0.70 0.99 0.70 0.99 0.63 0.99 0.66
  • There seems to be overfitting in the above models as there is a significant (more than 25%) difference in the train and test metrics.
2. Logistic regression¶
In [ ]:
# Logistic Regression Base Model on original training data
Logistic_Regression = LogisticRegression()
Logistic_Regression.fit(X_train, y_train)
# Checking train accuracy
model_score_train = Logistic_Regression.score(X_train, y_train)
print('Train Accuracy:', model_score_train)
# Checking test accuracy
model_score_test = Logistic_Regression.score(X_test, y_test)
print('Test Accuracy:', model_score_test)
Train Accuracy: 0.7395209580838323
Test Accuracy: 0.7380952380952381
  • It seems the base Logistic Regression model trained on imbalanced data is not overfitting because there is no significant difference between the train and test accuracy.
In [ ]:
#predict on test
y_predict = Logistic_Regression.predict(X_test)
# Performance metrics for base model on test data
print(metrics.classification_report(y_test, y_predict))
              precision    recall  f1-score   support

           I       0.74      1.00      0.85        62
          II       0.00      0.00      0.00         8
         III       0.00      0.00      0.00         6
          IV       0.00      0.00      0.00         6
           V       0.00      0.00      0.00         2

    accuracy                           0.74        84
   macro avg       0.15      0.20      0.17        84
weighted avg       0.54      0.74      0.63        84

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

In [ ]:
# Logistic Regression Base Model on balanced training data
Logistic_Regression_smote = LogisticRegression()
Logistic_Regression_smote.fit(X_train_smote, y_train_smote)
# Checking train accuracy
model_score_train = Logistic_Regression_smote.score(X_train_smote, y_train_smote)
print('Train Accuracy:', model_score_train)
# Checking test accuracy
model_score_test = Logistic_Regression_smote.score(X_test, y_test)
print('Test Accuracy:', model_score_test)
Train Accuracy: 0.994331983805668
Test Accuracy: 0.7023809523809523
  • It seems the base Logistic Regression model trained on balanced training data is overfitting because there is a significant difference between the train and test accuracy.
In [ ]:
#predict on test
y_predict = Logistic_Regression_smote.predict(X_test)
# Performance metrics for base model on test data
print(metrics.classification_report(y_test, y_predict))
C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

              precision    recall  f1-score   support

           I       0.79      0.89      0.83        62
          II       0.25      0.25      0.25         8
         III       0.25      0.17      0.20         6
          IV       0.50      0.17      0.25         6
           V       0.00      0.00      0.00         2

    accuracy                           0.70        84
   macro avg       0.36      0.29      0.31        84
weighted avg       0.66      0.70      0.67        84

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

  • Use cross validation techniques and apply hyper-parameter tuning techniques to get the best performance of the model.
In [ ]:
# grid search
# Choose the type of classifier.
Logistic_Regression_tuned = LogisticRegression(random_state=1)

# Grid of parameters to choose from
parameters = {'penalty': ['l1', 'l2', 'elasticnet', None],
              'C': [1.0, 10.0, 100.0],
              'solver' : ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'],
              'multi_class' : ['auto', 'ovr', 'multinomial']
             }

# Type of scoring used to compare parameter combinations
# acc_scorer = metrics.make_scorer(metrics.recall_score)


# Run the grid search
grid_obj = GridSearchCV(Logistic_Regression_tuned, parameters,cv=5) # , scoring=acc_scorer - # by default the scoring metrics would be the accuracy
# Used 5-fold cross validation
grid_obj = grid_obj.fit(X_train_smote, y_train_smote)

# Set the model to the best combination of parameters
Logistic_Regression_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
Logistic_Regression_tuned.fit(X_train_smote, y_train_smote)
C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_glm\_newton_solver.py:498: LinAlgWarning:

The inner solver of NewtonCholeskySolver stumbled upon a singular or very ill-conditioned Hessian matrix at iteration #1. It will now resort to lbfgs instead.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Matrix is singular.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:1186: UserWarning:

Setting penalty=None will ignore the C and l1_ratio parameters

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning:

The max_iter was reached which means the coef_ did not converge

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py:547: FitFailedWarning:


555 fits failed out of a total of 1080.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1172, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 67, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or None penalties, got l1 penalty.

--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1172, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 67, in _check_solver
    raise ValueError(
ValueError: Solver newton-cg supports only 'l2' or None penalties, got l1 penalty.

--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1172, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 67, in _check_solver
    raise ValueError(
ValueError: Solver newton-cholesky supports only 'l2' or None penalties, got l1 penalty.

--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1172, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 67, in _check_solver
    raise ValueError(
ValueError: Solver sag supports only 'l2' or None penalties, got l1 penalty.

--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1172, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 67, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or None penalties, got elasticnet penalty.

--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1172, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 75, in _check_solver
    raise ValueError(
ValueError: Only 'saga' solver supports elasticnet penalty, got solver=liblinear.

--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1172, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 67, in _check_solver
    raise ValueError(
ValueError: Solver newton-cg supports only 'l2' or None penalties, got elasticnet penalty.

--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1172, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 67, in _check_solver
    raise ValueError(
ValueError: Solver newton-cholesky supports only 'l2' or None penalties, got elasticnet penalty.

--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1172, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 67, in _check_solver
    raise ValueError(
ValueError: Solver sag supports only 'l2' or None penalties, got elasticnet penalty.

--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1182, in fit
    raise ValueError("l1_ratio must be specified when penalty is elasticnet.")
ValueError: l1_ratio must be specified when penalty is elasticnet.

--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1172, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 80, in _check_solver
    raise ValueError("penalty=None is not supported for the liblinear solver")
ValueError: penalty=None is not supported for the liblinear solver

--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1212, in fit
    multi_class = _check_multi_class(self.multi_class, solver, len(self.classes_))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 99, in _check_multi_class
    raise ValueError("Solver %s does not support a multinomial backend." % solver)
ValueError: Solver liblinear does not support a multinomial backend.

--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1212, in fit
    multi_class = _check_multi_class(self.multi_class, solver, len(self.classes_))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 99, in _check_multi_class
    raise ValueError("Solver %s does not support a multinomial backend." % solver)
ValueError: Solver newton-cholesky does not support a multinomial backend.


C:\Users\anime\anaconda3\Lib\site-packages\sklearn\model_selection\_search.py:1051: UserWarning:

One or more of the test scores are non-finite: [       nan 0.91902834        nan        nan        nan 0.90931174
 0.97732794 0.96761134 0.97732794 0.96923077 0.97732794 0.97732794
        nan        nan        nan        nan        nan        nan
 0.97651822        nan 0.95870445 0.97165992 0.97246964 0.97327935
        nan 0.91902834        nan        nan        nan 0.91902834
 0.96923077 0.96761134 0.96923077 0.96923077 0.96923077 0.96761134
        nan        nan        nan        nan        nan        nan
 0.97165992        nan 0.95303644 0.97165992 0.97004049 0.97165992
        nan        nan        nan        nan        nan 0.90931174
 0.97732794        nan 0.97732794        nan 0.97732794 0.97732794
        nan        nan        nan        nan        nan        nan
 0.97651822        nan 0.95870445        nan 0.97246964 0.97327935
        nan 0.96923077        nan        nan        nan 0.97246964
 0.9805668  0.97732794 0.9805668  0.97732794 0.9805668  0.97813765
        nan        nan        nan        nan        nan        nan
 0.97651822        nan 0.95870445 0.97165992 0.97246964 0.97327935
        nan 0.96923077        nan        nan        nan 0.97246964
 0.97651822 0.97732794 0.97732794 0.97732794 0.97732794 0.9757085
        nan        nan        nan        nan        nan        nan
 0.97165992        nan 0.95303644 0.97165992 0.97004049 0.97165992
        nan        nan        nan        nan        nan 0.97246964
 0.9805668         nan 0.9805668         nan 0.9805668  0.97813765
        nan        nan        nan        nan        nan        nan
 0.97651822        nan 0.95870445        nan 0.97246964 0.97327935
        nan 0.96194332        nan        nan        nan 0.97165992
 0.97894737 0.97732794 0.97813765 0.97732794 0.9757085  0.97408907
        nan        nan        nan        nan        nan        nan
 0.97651822        nan 0.95870445 0.97165992 0.97246964 0.97327935
        nan 0.96194332        nan        nan        nan 0.97246964
 0.9757085  0.97732794 0.97732794 0.97732794 0.97327935 0.97327935
        nan        nan        nan        nan        nan        nan
 0.97165992        nan 0.95303644 0.97165992 0.97004049 0.97165992
        nan        nan        nan        nan        nan 0.97165992
 0.97894737        nan 0.97813765        nan 0.9757085  0.97408907
        nan        nan        nan        nan        nan        nan
 0.97651822        nan 0.95870445        nan 0.97246964 0.97327935]

Out[ ]:
LogisticRegression(C=10.0, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=10.0, random_state=1)
In [ ]:
# checking best parameters
grid_obj.best_params_
Out[ ]:
{'C': 10.0, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'lbfgs'}
In [ ]:
# Checking train accuracy
model_score_train = Logistic_Regression_tuned.score(X_train_smote, y_train_smote)
print('Train Accuracy:', model_score_train)
# Checking test accuracy
model_score_test = Logistic_Regression_tuned.score(X_test, y_test)
print('Test Accuracy:', model_score_test)
Train Accuracy: 0.9975708502024292
Test Accuracy: 0.7023809523809523
  • It seems the tuned Logistic Regression model trained on balanced training data is overfitting because there is a significant difference between the train and test accuracy.
In [ ]:
#predict on test
y_predict = Logistic_Regression_tuned.predict(X_test)
# Performance metrics for base model on test data
print(metrics.classification_report(y_test, y_predict))
              precision    recall  f1-score   support

           I       0.78      0.90      0.84        62
          II       0.17      0.12      0.14         8
         III       0.25      0.17      0.20         6
          IV       0.50      0.17      0.25         6
           V       0.00      0.00      0.00         2

    accuracy                           0.70        84
   macro avg       0.34      0.27      0.29        84
weighted avg       0.64      0.70      0.66        84

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

In [ ]:
# defining list of models for model comparison
models = [Logistic_Regression_smote, Logistic_Regression_tuned]

# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_score_train = []
f1_score_test = []

# looping through all the models to get the accuracy, recall, precision and f-1 scores
for model in models:
    j = get_metrics_score(model,False)
    acc_train.append(np.round(j[0],2))
    acc_test.append(np.round(j[1],2))
    precision_train.append(np.round(j[2],2))
    precision_test.append(np.round(j[3],2))
    recall_train.append(np.round(j[4],2))
    recall_test.append(np.round(j[5],2))
    f1_score_train.append(np.round(j[6],2))
    f1_score_test.append(np.round(j[7],2))

comparison_frame = pd.DataFrame({'Model':['Logistic_Regression_smote', 'Logistic_Regression_tuned'],
                                          'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
                                          'Train_Recall':recall_train,'Test_Recall':recall_test,
                                          'Train_Precision':precision_train,'Test_Precision':precision_test,
                                          'Train_f1_score':f1_score_train,'Test_f1_score':f1_score_test
                                })
comparison_frame
Out[ ]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_f1_score Test_f1_score
0 Logistic_Regression_smote 0.99 0.7 0.99 0.7 0.99 0.66 0.99 0.67
1 Logistic_Regression_tuned 1.00 0.7 1.00 0.7 1.00 0.64 1.00 0.66
  • There seems to be overfitting in the above models as there is a significant (more than 25%) difference in the train and test metrics.
3. SVM¶
In [ ]:
from sklearn import svm
In [ ]:
# SVM Base Model on original training data
SVM = svm.SVC()
SVM.fit(X_train, y_train)
# Checking train accuracy
model_score_train = SVM.score(X_train, y_train)
print('Train Accuracy:', model_score_train)
# Checking test accuracy
model_score_test = SVM.score(X_test, y_test)
print('Test Accuracy:', model_score_test)
Train Accuracy: 0.7724550898203593
Test Accuracy: 0.7380952380952381
  • It seems the base SVM model trained on imbalanced data is not overfitting because there is no significant difference between the train and test accuracy.
In [ ]:
#predict on test
y_predict = SVM.predict(X_test)
# Performance metrics for base model on test data
print(metrics.classification_report(y_test, y_predict))
              precision    recall  f1-score   support

           I       0.74      1.00      0.85        62
          II       0.00      0.00      0.00         8
         III       0.00      0.00      0.00         6
          IV       0.00      0.00      0.00         6
           V       0.00      0.00      0.00         2

    accuracy                           0.74        84
   macro avg       0.15      0.20      0.17        84
weighted avg       0.54      0.74      0.63        84

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

In [ ]:
# SVM Base Model on balanced training data
SVM_smote = svm.SVC()
SVM_smote.fit(X_train_smote, y_train_smote)
# Checking train accuracy
model_score_train = SVM_smote.score(X_train_smote, y_train_smote)
print('Train Accuracy:', model_score_train)
# Checking test accuracy
model_score_test = SVM_smote.score(X_test, y_test)
print('Test Accuracy:', model_score_test)
Train Accuracy: 0.9975708502024292
Test Accuracy: 0.7380952380952381
  • It seems the base SVM model is trained on balanced training data is overfitting because there is a significant difference between the train and test accuracy.
In [ ]:
#predict on test
y_predict = SVM_smote.predict(X_test)
# Performance metrics for base model on test data
print(metrics.classification_report(y_test, y_predict))
              precision    recall  f1-score   support

           I       0.75      1.00      0.86        62
          II       0.00      0.00      0.00         8
         III       0.00      0.00      0.00         6
          IV       0.00      0.00      0.00         6
           V       0.00      0.00      0.00         2

    accuracy                           0.74        84
   macro avg       0.15      0.20      0.17        84
weighted avg       0.55      0.74      0.63        84

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

  • Use cross validation techniques and apply hyper-parameter tuning techniques to get the best performance of the model.
In [ ]:
# grid search
# Choose the type of classifier.
SVM_tuned = svm.SVC(random_state=1)

# Grid of parameters to choose from
parameters = {'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
              'C': [1.0, 10.0, 100.0],
              'degree': [2,3,4],
              'gamma' : ['scale', 'auto'],
              'decision_function_shape' : ['ovo', 'ovr']
             }

# Type of scoring used to compare parameter combinations
# acc_scorer = metrics.make_scorer(metrics.recall_score)


# Run the grid search
grid_obj = GridSearchCV(SVM_tuned, parameters,cv=5) # , scoring=acc_scorer - # by default the scoring metrics would be the accuracy
# Used 5-fold cross validation
grid_obj = grid_obj.fit(X_train_smote, y_train_smote)

# Set the model to the best combination of parameters
SVM_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
SVM_tuned.fit(X_train_smote, y_train_smote)
Out[ ]:
SVC(decision_function_shape='ovo', degree=2, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(decision_function_shape='ovo', degree=2, random_state=1)
In [ ]:
# checking best parameters
grid_obj.best_params_
Out[ ]:
{'C': 1.0,
 'decision_function_shape': 'ovo',
 'degree': 2,
 'gamma': 'scale',
 'kernel': 'rbf'}
In [ ]:
# Checking train accuracy
model_score_train = SVM_tuned.score(X_train_smote, y_train_smote)
print('Train Accuracy:', model_score_train)
# Checking test accuracy
model_score_test = SVM_tuned.score(X_test, y_test)
print('Test Accuracy:', model_score_test)
Train Accuracy: 0.9975708502024292
Test Accuracy: 0.7380952380952381
  • It seems the tuned SVM model trained on balanced training data is overfitting because there is a significant difference between the train and test accuracy.
In [ ]:
#predict on test
y_predict = SVM_tuned.predict(X_test)
# Performance metrics for base model on test data
print(metrics.classification_report(y_test, y_predict))
              precision    recall  f1-score   support

           I       0.75      1.00      0.86        62
          II       0.00      0.00      0.00         8
         III       0.00      0.00      0.00         6
          IV       0.00      0.00      0.00         6
           V       0.00      0.00      0.00         2

    accuracy                           0.74        84
   macro avg       0.15      0.20      0.17        84
weighted avg       0.55      0.74      0.63        84

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

In [ ]:
# defining list of models for model comparison
models = [SVM_smote, SVM_tuned]

# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_score_train = []
f1_score_test = []

# looping through all the models to get the accuracy, recall, precision and f-1 scores
for model in models:
    j = get_metrics_score(model,False)
    acc_train.append(np.round(j[0],2))
    acc_test.append(np.round(j[1],2))
    precision_train.append(np.round(j[2],2))
    precision_test.append(np.round(j[3],2))
    recall_train.append(np.round(j[4],2))
    recall_test.append(np.round(j[5],2))
    f1_score_train.append(np.round(j[6],2))
    f1_score_test.append(np.round(j[7],2))

# Creating dataframe for all the metrics for the models listed

comparison_frame = pd.DataFrame({'Model':['SVM_smote', 'SVM_tuned'],
                                          'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
                                          'Train_Recall':recall_train,'Test_Recall':recall_test,
                                          'Train_Precision':precision_train,'Test_Precision':precision_test,
                                          'Train_f1_score':f1_score_train,'Test_f1_score':f1_score_test
                                })
comparison_frame
Out[ ]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_f1_score Test_f1_score
0 SVM_smote 1.0 0.74 1.0 0.74 1.0 0.55 1.0 0.63
1 SVM_tuned 1.0 0.74 1.0 0.74 1.0 0.55 1.0 0.63
  • There seems to be overfitting in the above models as there is a significant (more than 25%) difference in the train and test metrics.
4. Random Forest¶
In [ ]:
# Random Forest Base Model on original training data
Random_Forest = RandomForestClassifier(random_state=1)
Random_Forest.fit(X_train, y_train)
# Checking train accuracy
model_score_train = Random_Forest.score(X_train, y_train)
print('Train Accuracy:', model_score_train)
# Checking test accuracy
model_score_test = Random_Forest.score(X_test, y_test)
print('Test Accuracy:', model_score_test)
Train Accuracy: 0.9970059880239521
Test Accuracy: 0.75
  • It seems the base Random Forest model trained on imbalanced training data is overfitting because there is a significant difference between the train and test accuracy.
In [ ]:
#predict on test
y_predict = Random_Forest.predict(X_test)
# Performance metrics for base model on test data
print(metrics.classification_report(y_test, y_predict))
              precision    recall  f1-score   support

           I       0.76      1.00      0.86        62
          II       0.50      0.12      0.20         8
         III       0.00      0.00      0.00         6
          IV       0.00      0.00      0.00         6
           V       0.00      0.00      0.00         2

    accuracy                           0.75        84
   macro avg       0.25      0.23      0.21        84
weighted avg       0.61      0.75      0.65        84

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

In [ ]:
# Random forest Base Model on balanced training data
Random_Forest_smote = RandomForestClassifier(random_state=1)
Random_Forest_smote.fit(X_train_smote, y_train_smote)
# Checking train accuracy
model_score_train = Random_Forest_smote.score(X_train_smote, y_train_smote)
print('Train Accuracy:', model_score_train)
# Checking test accuracy
model_score_test = Random_Forest_smote.score(X_test, y_test)
print('Test Accuracy:', model_score_test)
Train Accuracy: 0.9991902834008097
Test Accuracy: 0.7380952380952381
  • It seems the base Random Forest model trained on balanced training data is overfitting because there is a significant difference between the train and test accuracy
In [ ]:
#predict on test
y_predict = Random_Forest_smote.predict(X_test)
# Performance metrics for base model on test data
print(metrics.classification_report(y_test, y_predict))
              precision    recall  f1-score   support

           I       0.76      1.00      0.86        62
          II       0.00      0.00      0.00         8
         III       0.00      0.00      0.00         6
          IV       0.00      0.00      0.00         6
           V       0.00      0.00      0.00         2

    accuracy                           0.74        84
   macro avg       0.15      0.20      0.17        84
weighted avg       0.56      0.74      0.64        84

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

  • Use cross validation techniques and apply hyper-parameter tuning techniques to get the best performance of the model.
In [ ]:
# grid search
# Choose the type of classifier.
Random_Forest_tuned = RandomForestClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {"n_estimators": [100,150,200,250],
    "min_samples_leaf": [5, 10],
    "max_features": [0.2, 0.7],
    "max_samples": [0.3, 0.7]
             }

# Type of scoring used to compare parameter combinations
# acc_scorer = metrics.make_scorer(metrics.recall_score)


# Run the grid search
grid_obj = GridSearchCV(Random_Forest_tuned, parameters,cv=5) # , scoring=acc_scorer - # by default the scoring metrics would be the accuracy
# Used 5-fold cross validation
grid_obj = grid_obj.fit(X_train_smote, y_train_smote)

# Set the model to the best combination of parameters
Random_Forest_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
Random_Forest_tuned.fit(X_train_smote, y_train_smote)
Out[ ]:
RandomForestClassifier(max_features=0.2, max_samples=0.7, min_samples_leaf=5,
                       n_estimators=150, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_features=0.2, max_samples=0.7, min_samples_leaf=5,
                       n_estimators=150, random_state=1)
In [ ]:
# checking best parameters
grid_obj.best_params_
Out[ ]:
{'max_features': 0.2,
 'max_samples': 0.7,
 'min_samples_leaf': 5,
 'n_estimators': 150}
In [ ]:
# Checking train accuracy
model_score_train = Random_Forest_tuned.score(X_train_smote, y_train_smote)
print('Train Accuracy:', model_score_train)
# Checking test accuracy
model_score_test = Random_Forest_tuned.score(X_test, y_test)
print('Test Accuracy:', model_score_test)
Train Accuracy: 0.982995951417004
Test Accuracy: 0.7142857142857143
  • It seems the tuned Random Forest model trained on balanced training data is overfitting because there is a significant difference between the train and test accuracy
In [ ]:
#predict on test
y_predict = Random_Forest_tuned.predict(X_test)
# Performance metrics for base model on test data
print(metrics.classification_report(y_test, y_predict))
              precision    recall  f1-score   support

           I       0.79      0.94      0.86        62
          II       0.00      0.00      0.00         8
         III       0.67      0.33      0.44         6
          IV       0.00      0.00      0.00         6
           V       0.00      0.00      0.00         2

    accuracy                           0.71        84
   macro avg       0.29      0.25      0.26        84
weighted avg       0.63      0.71      0.67        84

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

  • Use cross validation techniques and apply hyper-parameter tuning techniques to get the best performance of the model.
In [ ]:
# grid search
# Choose the type of classifier.
Random_Forest_tuned = RandomForestClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {"n_estimators": [100,150,200,250],
    "min_samples_leaf": [5, 10],
    "max_features": [0.2, 0.7],
    "max_samples": [0.3, 0.7]
             }

# Type of scoring used to compare parameter combinations
# acc_scorer = metrics.make_scorer(metrics.recall_score)


# Run the grid search
grid_obj = GridSearchCV(Random_Forest_tuned, parameters,cv=5) # , scoring=acc_scorer - # by default the scoring metrics would be the accuracy
# Used 5-fold cross validation
grid_obj = grid_obj.fit(X_train_smote, y_train_smote)

# Set the model to the best combination of parameters
Random_Forest_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
Random_Forest_tuned.fit(X_train_smote, y_train_smote)
Out[ ]:
RandomForestClassifier(max_features=0.2, max_samples=0.7, min_samples_leaf=5,
                       n_estimators=150, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_features=0.2, max_samples=0.7, min_samples_leaf=5,
                       n_estimators=150, random_state=1)
In [ ]:
# defining list of models for model comparison
models = [Random_Forest_smote, Random_Forest_tuned]

# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_score_train = []
f1_score_test = []

# looping through all the models to get the accuracy, recall, precision and f-1 scores
for model in models:
    j = get_metrics_score(model,False)
    acc_train.append(np.round(j[0],2))
    acc_test.append(np.round(j[1],2))
    precision_train.append(np.round(j[2],2))
    precision_test.append(np.round(j[3],2))
    recall_train.append(np.round(j[4],2))
    recall_test.append(np.round(j[5],2))
    f1_score_train.append(np.round(j[6],2))
    f1_score_test.append(np.round(j[7],2))

comparison_frame = pd.DataFrame({'Model':['Random_Forest_smote', 'Random_Forest_tuned'],
                                          'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
                                          'Train_Recall':recall_train,'Test_Recall':recall_test,
                                          'Train_Precision':precision_train,'Test_Precision':precision_test,
                                          'Train_f1_score':f1_score_train,'Test_f1_score':f1_score_test
                                })
comparison_frame
Out[ ]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_f1_score Test_f1_score
0 Random_Forest_smote 1.00 0.74 1.00 0.74 1.00 0.56 1.00 0.64
1 Random_Forest_tuned 0.98 0.71 0.98 0.71 0.98 0.63 0.98 0.67
  • There seems to be overfitting in the above models as there is a significant (more than 25%) difference in the train and test metrics.
5. GradientBoost¶
In [ ]:
# Gradient Boost Base Model on original training data
GradientBoost = GradientBoostingClassifier(random_state=1)
GradientBoost.fit(X_train, y_train)
# Checking train accuracy
model_score_train = GradientBoost.score(X_train, y_train)
print('Train Accuracy:', model_score_train)
# Checking test accuracy
model_score_test = GradientBoost.score(X_test, y_test)
print('Test Accuracy:', model_score_test)
Train Accuracy: 0.9970059880239521
Test Accuracy: 0.6904761904761905
  • It seems the base GradientBoost model trained on imbalanced data is overfitting because there is a significant difference between the train and test accuracy
In [ ]:
#predict on test
y_predict = GradientBoost.predict(X_test)
# Performance metrics for base model on test data
print(metrics.classification_report(y_test, y_predict))
              precision    recall  f1-score   support

           I       0.75      0.92      0.83        62
          II       0.00      0.00      0.00         8
         III       0.00      0.00      0.00         6
          IV       0.25      0.17      0.20         6
           V       0.00      0.00      0.00         2

    accuracy                           0.69        84
   macro avg       0.20      0.22      0.21        84
weighted avg       0.57      0.69      0.62        84

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

In [ ]:
# Gradient Boost Base Model on balanced training data
GradientBoost_smote = GradientBoostingClassifier(random_state=1)
GradientBoost_smote.fit(X_train_smote, y_train_smote)
# Checking train accuracy
model_score_train = GradientBoost_smote.score(X_train_smote, y_train_smote)
print('Train Accuracy:', model_score_train)
# Checking test accuracy
model_score_test = GradientBoost_smote.score(X_test, y_test)
print('Test Accuracy:', model_score_test)
Train Accuracy: 0.9991902834008097
Test Accuracy: 0.7023809523809523
  • It seems the base Gradient Boost model trained on balanced training data is overfitting because there is a significant difference between the train and test accuracy
In [ ]:
#predict on test
y_predict = GradientBoost_smote.predict(X_test)
# Performance metrics for base model on test data
print(metrics.classification_report(y_test, y_predict))
              precision    recall  f1-score   support

           I       0.81      0.94      0.87        62
          II       0.00      0.00      0.00         8
         III       0.00      0.00      0.00         6
          IV       0.50      0.17      0.25         6
           V       0.00      0.00      0.00         2

    accuracy                           0.70        84
   macro avg       0.26      0.22      0.22        84
weighted avg       0.63      0.70      0.66        84

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

  • Use cross validation techniques and apply hyper-parameter tuning techniques to get the best performance of the model.
In [ ]:
# grid search
# Choose the type of classifier.
GradientBoost_tuned = GradientBoostingClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    "loss": ['exponential', 'log_loss'],
    "learning_rate": [0.1, 1.0,10.0],
    "n_estimators": [50,100,200],
    "subsample": [0.1, 0.5, 1.0],
    "max_depth": [1,3,5],
             }

# Type of scoring used to compare parameter combinations
# acc_scorer = metrics.make_scorer(metrics.recall_score)


# Run the grid search
grid_obj = GridSearchCV(GradientBoost_tuned, parameters,cv=5) # , scoring=acc_scorer - # by default the scoring metrics would be the accuracy
# Used 5-fold cross validation
grid_obj = grid_obj.fit(X_train_smote, y_train_smote)

# Set the model to the best combination of parameters
GradientBoost_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
GradientBoost_tuned.fit(X_train_smote, y_train_smote)
C:\Users\anime\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py:547: FitFailedWarning:


405 fits failed out of a total of 810.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
405 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\ensemble\_gb.py", line 673, in fit
    self._loss = self._get_loss(sample_weight=sample_weight)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\anime\anaconda3\Lib\site-packages\sklearn\ensemble\_gb.py", line 1537, in _get_loss
    raise ValueError(
ValueError: loss='exponential' is only suitable for a binary classification problem, you have n_classes=5. Please use loss='log_loss' instead.


C:\Users\anime\anaconda3\Lib\site-packages\sklearn\model_selection\_search.py:1051: UserWarning:

One or more of the test scores are non-finite: [       nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan 0.90040486 0.91740891 0.91983806
 0.93522267 0.93360324 0.93927126 0.9465587  0.94574899 0.94251012
 0.94493927 0.9465587  0.94251012 0.95546559 0.94817814 0.94898785
 0.96437247 0.95465587 0.95060729 0.9465587  0.9562753  0.93927126
 0.9659919  0.95708502 0.94979757 0.97408907 0.9611336  0.95384615
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan 0.16032389 0.90040486 0.92712551
 0.16032389 0.90202429 0.94493927 0.16032389 0.76923077 0.94979757
 0.2097166  0.91740891 0.94493927 0.2097166  0.92712551 0.95789474
 0.2097166  0.93522267 0.95870445 0.14736842 0.94574899 0.951417
 0.1465587  0.951417   0.95708502 0.1465587  0.95708502 0.95789474
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan 0.23643725 0.21214575 0.29230769
 0.23643725 0.21214575 0.29230769 0.23643725 0.21214575 0.29230769
 0.25587045 0.12226721 0.11497976 0.25587045 0.12226721 0.11497976
 0.25587045 0.12226721 0.11497976 0.18866397 0.09392713 0.09959514
 0.18866397 0.09392713 0.09959514 0.20890688 0.09392713 0.09959514]

Out[ ]:
GradientBoostingClassifier(max_depth=5, n_estimators=200, random_state=1,
                           subsample=0.1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(max_depth=5, n_estimators=200, random_state=1,
                           subsample=0.1)
In [ ]:
# checking best parameters
grid_obj.best_params_
Out[ ]:
{'learning_rate': 0.1,
 'loss': 'log_loss',
 'max_depth': 5,
 'n_estimators': 200,
 'subsample': 0.1}
In [ ]:
# Checking train accuracy
model_score_train = GradientBoost_tuned.score(X_train_smote, y_train_smote)
print('Train Accuracy:', model_score_train)
# Checking test accuracy
model_score_test = GradientBoost_tuned.score(X_test, y_test)
print('Test Accuracy:', model_score_test)
Train Accuracy: 0.9983805668016195
Test Accuracy: 0.7261904761904762
  • It seems the tuned Gradient Boost model trained on balanced training data is overfitting because there is a significant difference between the train and test accuracy
In [ ]:
#predict on test
y_predict = GradientBoost_tuned.predict(X_test)
# Performance metrics for base model on test data
print(metrics.classification_report(y_test, y_predict))
              precision    recall  f1-score   support

           I       0.82      0.95      0.88        62
          II       0.11      0.12      0.12         8
         III       0.33      0.17      0.22         6
          IV       0.00      0.00      0.00         6
           V       0.00      0.00      0.00         2

    accuracy                           0.73        84
   macro avg       0.25      0.25      0.24        84
weighted avg       0.64      0.73      0.68        84

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\anime\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

In [ ]:
# defining list of models for model comparison
models = [GradientBoost_smote, GradientBoost_tuned]

# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_score_train = []
f1_score_test = []

# looping through all the models to get the accuracy, recall, precision and f-1 scores
for model in models:
    j = get_metrics_score(model,False)
    acc_train.append(np.round(j[0],2))
    acc_test.append(np.round(j[1],2))
    precision_train.append(np.round(j[2],2))
    precision_test.append(np.round(j[3],2))
    recall_train.append(np.round(j[4],2))
    recall_test.append(np.round(j[5],2))
    f1_score_train.append(np.round(j[6],2))
    f1_score_test.append(np.round(j[7],2))

comparison_frame = pd.DataFrame({'Model':['GradientBoost_smote', 'GradientBoost_tuned'],
                                          'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
                                          'Train_Recall':recall_train,'Test_Recall':recall_test,
                                          'Train_Precision':precision_train,'Test_Precision':precision_test,
                                          'Train_f1_score':f1_score_train,'Test_f1_score':f1_score_test
                                })
comparison_frame
Out[ ]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_f1_score Test_f1_score
0 GradientBoost_smote 1.0 0.70 1.0 0.70 1.0 0.63 1.0 0.66
1 GradientBoost_tuned 1.0 0.73 1.0 0.73 1.0 0.64 1.0 0.68
  • There seems to be overfitting in the above models as there is a significant (more than 25%) difference in the train and test metrics.
Display and compare all the models using balanced data(SMOTE) and hyper-parameter tuned models with their train and test metrics.¶
In [ ]:
# defining list of models
models = [Naive_Bayes_smote, Naive_Bayes_tuned, Logistic_Regression_smote, Logistic_Regression_tuned, SVM_smote, SVM_tuned, Random_Forest_smote, Random_Forest_tuned,  GradientBoost_smote, GradientBoost_tuned]

# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_score_train = []
f1_score_test = []

# looping through all the models to get the accuracy, recall, precision and f-1 scores
for model in models:
    j = get_metrics_score(model,False)
    acc_train.append(np.round(j[0],2))
    acc_test.append(np.round(j[1],2))
    precision_train.append(np.round(j[2],2))
    precision_test.append(np.round(j[3],2))
    recall_train.append(np.round(j[4],2))
    recall_test.append(np.round(j[5],2))
    f1_score_train.append(np.round(j[6],2))
    f1_score_test.append(np.round(j[7],2))

comparison_frame = pd.DataFrame({'Model':['Naive_Bayes_smote', 'Naive_Bayes_tuned', 'Logistic_Regression_smote', 'Logistic_Regression_tuned', 'SVM_smote', 'SVM_tuned', 'Random_Forest_smote', 'Random_Forest_tuned', 'GradientBoost_smote', 'GradientBoost_tuned'],
                                          'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
                                          'Train_Recall':recall_train,'Test_Recall':recall_test,
                                          'Train_Precision':precision_train,'Test_Precision':precision_test,
                                          'Train_f1_score':f1_score_train,'Test_f1_score':f1_score_test})
comparison_frame
Out[ ]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_f1_score Test_f1_score
0 Naive_Bayes_smote 0.96 0.63 0.96 0.63 0.96 0.72 0.96 0.66
1 Naive_Bayes_tuned 0.99 0.70 0.99 0.70 0.99 0.63 0.99 0.66
2 Logistic_Regression_smote 0.99 0.70 0.99 0.70 0.99 0.66 0.99 0.67
3 Logistic_Regression_tuned 1.00 0.70 1.00 0.70 1.00 0.64 1.00 0.66
4 SVM_smote 1.00 0.74 1.00 0.74 1.00 0.55 1.00 0.63
5 SVM_tuned 1.00 0.74 1.00 0.74 1.00 0.55 1.00 0.63
6 Random_Forest_smote 1.00 0.74 1.00 0.74 1.00 0.56 1.00 0.64
7 Random_Forest_tuned 0.98 0.71 0.98 0.71 0.98 0.63 0.98 0.67
8 GradientBoost_smote 1.00 0.70 1.00 0.70 1.00 0.63 1.00 0.66
9 GradientBoost_tuned 1.00 0.73 1.00 0.73 1.00 0.64 1.00 0.68
In [ ]:
from sklearn import metrics

# Function to calculate different metric scores(Accuracy, Recall, Precision, F1-score) of the base model trained on imbalanced data
def get_metrics_score_base_model(model, flag=True):
    '''
    model : classifier to predict values of X
    x_train_smote : Balanced training feature set
    x_test : Test feature set
    y_train_smote : Balanced training target labels
    y_test : Test target labels
    flag : If True, prints the classification report, accuracy, precision, recall, f1-score

    Returns:
    A list of train and test accuracy, precision, recall, F1-score (weighted average) metrics.
    '''
    # defining an empty list to store train and test results
    score_list = []

    # Predicting on train and test sets. We will use the balanced data
    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)

    # Accuracy of the model
    train_acc = metrics.accuracy_score(y_train, pred_train)
    test_acc = metrics.accuracy_score(y_test, pred_test)

    # Classification report for weighted average precision, recall, f1-score
    classification_report_train = metrics.classification_report(y_train, pred_train, output_dict=True, zero_division=0)
    classification_report_test = metrics.classification_report(y_test, pred_test, output_dict=True, zero_division=0)

    # Weighted averages

    train_precision = classification_report_train['weighted avg']['precision']
    test_precision = classification_report_test['weighted avg']['precision']

    train_recall = classification_report_train['weighted avg']['recall']
    test_recall = classification_report_test['weighted avg']['recall']

    train_f1_score = classification_report_train['weighted avg']['f1-score']
    test_f1_score = classification_report_test['weighted avg']['f1-score']

    # Append all metrics (train/test accuracy, precision, recall, f1-score)
    score_list.extend([train_acc, test_acc, train_precision, test_precision, train_recall, test_recall, train_f1_score, test_f1_score])

    # If the flag is set to True, print out the classification reports and the metrics
    if flag:
        print("Classification Report (Train):\n", metrics.classification_report(y_train, pred_train, zero_division=0))
        print("Classification Report (Test):\n", metrics.classification_report(y_test, pred_test, zero_division=0))
        print("\nMetrics Summary:")
        print(f"Accuracy on Training set: {train_acc}")
        print(f"Accuracy on Test set: {test_acc}")
        print(f"Precision on Training set (Weighted Avg): {train_precision}")
        print(f"Precision on Test set (Weighted Avg): {test_precision}")
        print(f"Recall on Training set (Weighted Avg): {train_recall}")
        print(f"Recall on Test set (Weighted Avg): {test_recall}")
        print(f"F1-Score on Training set (Weighted Avg): {train_f1_score}")
        print(f"F1-Score on Test set (Weighted Avg): {test_f1_score}")

    return score_list  # returning the list with train and test scores
# ----------------------------------------------------------------------------
Display and compare all the base models trained on unbalanced data with their train and test metrics.¶
In [ ]:
# defining list of base models for model comparison
models = [Naive_Bayes, Logistic_Regression, SVM, Random_Forest, GradientBoost]

# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_score_train = []
f1_score_test = []

# looping through all the models to get the accuracy, recall, precision and f-1 scores
for model in models:
    j = get_metrics_score_base_model(model,False)
    acc_train.append(np.round(j[0],2))
    acc_test.append(np.round(j[1],2))
    precision_train.append(np.round(j[2],2))
    precision_test.append(np.round(j[3],2))
    recall_train.append(np.round(j[4],2))
    recall_test.append(np.round(j[5],2))
    f1_score_train.append(np.round(j[6],2))
    f1_score_test.append(np.round(j[7],2))


comparison_frame_base_model = pd.DataFrame({'Model':[ 'Naive_Bayes', 'Logistic_Regression', 'SVM', 'Random_Forest', 'GradientBoost'],
                                          'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
                                          'Train_Recall':recall_train,'Test_Recall':recall_test,
                                          'Train_Precision':precision_train,'Test_Precision':precision_test,
                                          'Train_f1_score':f1_score_train,'Test_f1_score':f1_score_test
                                })
comparison_frame_base_model
Out[ ]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_f1_score Test_f1_score
0 Naive_Bayes 0.74 0.74 0.74 0.74 0.55 0.54 0.63 0.63
1 Logistic_Regression 0.74 0.74 0.74 0.74 0.55 0.54 0.63 0.63
2 SVM 0.77 0.74 0.77 0.74 0.81 0.54 0.70 0.63
3 Random_Forest 1.00 0.75 1.00 0.75 1.00 0.61 1.00 0.65
4 GradientBoost 1.00 0.69 1.00 0.69 1.00 0.57 1.00 0.62
Display and compare all the models with their train and test metrics.¶
In [ ]:
# Concatenating the comparison dataframes to make final comaprisons of metrics for all the models created
Final_Comparison = (pd.concat([comparison_frame_base_model, comparison_frame], ignore_index=True,axis=0))
Final_Comparison
Out[ ]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_f1_score Test_f1_score
0 Naive_Bayes 0.74 0.74 0.74 0.74 0.55 0.54 0.63 0.63
1 Logistic_Regression 0.74 0.74 0.74 0.74 0.55 0.54 0.63 0.63
2 SVM 0.77 0.74 0.77 0.74 0.81 0.54 0.70 0.63
3 Random_Forest 1.00 0.75 1.00 0.75 1.00 0.61 1.00 0.65
4 GradientBoost 1.00 0.69 1.00 0.69 1.00 0.57 1.00 0.62
5 Naive_Bayes_smote 0.96 0.63 0.96 0.63 0.96 0.72 0.96 0.66
6 Naive_Bayes_tuned 0.99 0.70 0.99 0.70 0.99 0.63 0.99 0.66
7 Logistic_Regression_smote 0.99 0.70 0.99 0.70 0.99 0.66 0.99 0.67
8 Logistic_Regression_tuned 1.00 0.70 1.00 0.70 1.00 0.64 1.00 0.66
9 SVM_smote 1.00 0.74 1.00 0.74 1.00 0.55 1.00 0.63
10 SVM_tuned 1.00 0.74 1.00 0.74 1.00 0.55 1.00 0.63
11 Random_Forest_smote 1.00 0.74 1.00 0.74 1.00 0.56 1.00 0.64
12 Random_Forest_tuned 0.98 0.71 0.98 0.71 0.98 0.63 0.98 0.67
13 GradientBoost_smote 1.00 0.70 1.00 0.70 1.00 0.63 1.00 0.66
14 GradientBoost_tuned 1.00 0.73 1.00 0.73 1.00 0.64 1.00 0.68
Select the final best trained model¶
In [ ]:
# Ordering the models based on the descending order of Test_Accuracy
Final_Comparison.sort_values("Test_Accuracy", ascending=False) # inplace=True
Out[ ]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_f1_score Test_f1_score
3 Random_Forest 1.00 0.75 1.00 0.75 1.00 0.61 1.00 0.65
0 Naive_Bayes 0.74 0.74 0.74 0.74 0.55 0.54 0.63 0.63
1 Logistic_Regression 0.74 0.74 0.74 0.74 0.55 0.54 0.63 0.63
2 SVM 0.77 0.74 0.77 0.74 0.81 0.54 0.70 0.63
9 SVM_smote 1.00 0.74 1.00 0.74 1.00 0.55 1.00 0.63
10 SVM_tuned 1.00 0.74 1.00 0.74 1.00 0.55 1.00 0.63
11 Random_Forest_smote 1.00 0.74 1.00 0.74 1.00 0.56 1.00 0.64
14 GradientBoost_tuned 1.00 0.73 1.00 0.73 1.00 0.64 1.00 0.68
12 Random_Forest_tuned 0.98 0.71 0.98 0.71 0.98 0.63 0.98 0.67
6 Naive_Bayes_tuned 0.99 0.70 0.99 0.70 0.99 0.63 0.99 0.66
7 Logistic_Regression_smote 0.99 0.70 0.99 0.70 0.99 0.66 0.99 0.67
8 Logistic_Regression_tuned 1.00 0.70 1.00 0.70 1.00 0.64 1.00 0.66
13 GradientBoost_smote 1.00 0.70 1.00 0.70 1.00 0.63 1.00 0.66
4 GradientBoost 1.00 0.69 1.00 0.69 1.00 0.57 1.00 0.62
5 Naive_Bayes_smote 0.96 0.63 0.96 0.63 0.96 0.72 0.96 0.66
In [ ]:
# Ordering the models based on the descending order of Test_f1_score
Final_Comparison.sort_values("Test_f1_score", ascending=False) # inplace=True
Out[ ]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_f1_score Test_f1_score
14 GradientBoost_tuned 1.00 0.73 1.00 0.73 1.00 0.64 1.00 0.68
7 Logistic_Regression_smote 0.99 0.70 0.99 0.70 0.99 0.66 0.99 0.67
12 Random_Forest_tuned 0.98 0.71 0.98 0.71 0.98 0.63 0.98 0.67
5 Naive_Bayes_smote 0.96 0.63 0.96 0.63 0.96 0.72 0.96 0.66
6 Naive_Bayes_tuned 0.99 0.70 0.99 0.70 0.99 0.63 0.99 0.66
8 Logistic_Regression_tuned 1.00 0.70 1.00 0.70 1.00 0.64 1.00 0.66
13 GradientBoost_smote 1.00 0.70 1.00 0.70 1.00 0.63 1.00 0.66
3 Random_Forest 1.00 0.75 1.00 0.75 1.00 0.61 1.00 0.65
11 Random_Forest_smote 1.00 0.74 1.00 0.74 1.00 0.56 1.00 0.64
0 Naive_Bayes 0.74 0.74 0.74 0.74 0.55 0.54 0.63 0.63
1 Logistic_Regression 0.74 0.74 0.74 0.74 0.55 0.54 0.63 0.63
2 SVM 0.77 0.74 0.77 0.74 0.81 0.54 0.70 0.63
9 SVM_smote 1.00 0.74 1.00 0.74 1.00 0.55 1.00 0.63
10 SVM_tuned 1.00 0.74 1.00 0.74 1.00 0.55 1.00 0.63
4 GradientBoost 1.00 0.69 1.00 0.69 1.00 0.57 1.00 0.62

Insights and Conclusion:¶

  1. Based on Accuracy Random Forest base model trained on unbalanced data have the best accuracy followed by Naive Bayes, Logistic Regression and SVM base model. However, Random Forest base model trained on unbalanced data has significant amount of overfitting as there is a difference of more than 25% in the test and train metrics.

  2. Based on F-1 Score Tuned Gradient Boost model on trained on balanced training data have the best f-1 score. There are various other hyper-parameter tuned model which is giving a good f-1 score but all of these models are overfitting because there is a difference of more than or around 25% in the test and train metrics.

  3. Naive Bayes, Logistic Regression and SVM base model trained on unbalanced data have the same data for test metrics (Accuracy, precision, recall, f1-score) and these models are also not having over-fitting.

  4. We choose Naive Bayes as the best model because of the below reasons:

    1. Simplicity and Scalability:
      1. Naive Bayes is a simple yet highly effective algorithm for text classification tasks. It is computationally efficient and scales well to large datasets, making it ideal for dealing with a large number of accident descriptions.
      2. TF-IDF features represent text data in high-dimensional, sparse matrices. Naive Bayes (specifically Multinomial Naive Bayes) is inherently designed to handle discrete feature distributions and sparse matrices efficiently, such as term frequencies or TF-IDF values, making it a natural fit for text classification problems because it relies on simple statistical operations like counting word occurrences.
      3. It converges quickly and requires fewer iterations to fit, unlike SVMs, which can be computationally heavy due to their complex optimization procedures.
      4. Naive Bayes, due to its relatively simple modeling assumptions, is less likely to overfit the data compared to more flexible models like SVM or even Logistic Regression.
    2. Assumption of Independence:
      1. Naive Bayes assumes that all features (in this case, words or terms) are conditionally independent given the class label (i.e., accident severity). While this assumption is not entirely true in real-world text, it often works surprisingly well in practice for NLP tasks.
      2. In accident descriptions, each word contributes to the likelihood of a particular severity level. Since Naive Bayes works by multiplying the probabilities of individual words occurring in each class, it can capture strong patterns of word occurrences that correspond to different accident severity levels.
    3. Handling High Dimensionality:
      1. Text data, especially when transformed into TF-IDF features, results in high-dimensional feature spaces, where each term in the vocabulary is a feature. Naive Bayes can easily handle this high-dimensional space, as it does not need to perform complex optimization or tree-building procedures like other algorithms (e.g., GradientBoosting).
      2. Naive Bayes’ simplicity allows it to efficiently calculate probabilities for each class, even in the presence of thousands or tens of thousands of features, making it highly suitable for this use case.
    4. Handling Sparse Data:
      1. In TF-IDF matrices, most of the values are zeroes because not all words appear in every document. Naive Bayes is well-suited for sparse data, as it only calculates probabilities for the non-zero features (words present in the description). This makes it computationally efficient and fast for text data.
    5. Multi-Class Classification:
      1. Naive Bayes can handle multi-class classification problems effectively, which is critical in this case where the dependent variable (accident severity) has multiple levels (e.g., minor, moderate, severe, etc.).
      2. It directly estimates the probability of each class given the TF-IDF features, making it a natural fit for classifying accident severity based on text descriptions.
    6. Interpretability:
      1. Naive Bayes provides interpretable models by assigning probabilities to each class. This is useful when dealing with safety-related applications like accident severity classification, where stakeholders may want to understand the factors influencing the prediction.
      2. With Naive Bayes, we can easily understand which words contribute the most to the classification of each severity level by looking at the conditional probabilities of each word given a class.
  5. SVM and Logistic Regression might not be the best choice because of the below reasons:

    1. SVM:

      1. SVMs work well for classification tasks, especially when the data is not linearly separable. However, they can struggle with high-dimensional and sparse text data, such as TF-IDF representations.
      2. SVMs require careful tuning of hyperparameters like the regularization parameter (C) and kernel parameters. Without careful tuning, the model can easily overfit or underfit the data.
      3. SVMs are more computationally expensive, which can be a drawback for large text datasets.
    2. Logistic Regression:

      1. Logistic Regression can handle text data well but lacks the advantages Naive Bayes offers in terms of efficiency and robustness in the face of sparse text data. It often needs regularization techniques to prevent overfitting, and even then, it may not handle rare words or features effectively.
      2. If all three models give similar performance, choosing the simpler and more efficient Naive Bayes model would be ideal.

Improvements:¶

Improvements Attempted¶

  • Data Manipulation: We performed NLP pre-processing techniques like stopwords removal, special characters removal, lemmatization, lowercasing, punctuation removal, tokenization, and numbers removal data manipulation before starting the model-building process. We reduced noise by removing uninformative words (e.g., stopwords), irrelevant characters, and numbers. We normalized the text using lowercasing and lemmatization. We used tokenization to structure text into analyzable chunks.
  • Feature Selection: We have chosen the most significant terms based on term frequency or TF-IDF score for the model after using max_features =1000 with tfidf vectorizer. It helped us to reduce dimensionality and eliminate less informative words. Including all words could have led to overfitting, especially when rare or domain-specific words are present but contribute little to prediction. Limiting features helped us to improve model generalizability. Limiting to 1000 words kept the computations manageable while focusing on the most relevant words.
  • Balancing Imbalanced Data: We attempted to improve the performance of the model by balancing the data using SMOTE technique. SMOTE (Synthetic Minority Over-sampling Technique) generated synthetic samples for the minority class by interpolating between existing samples, thus balanced the class distribution. While this helped with class imbalance, it led to overfitting in all the models because the new synthetic samples were not real observations but artificially created based on the nearest neighbors of the minority class.
  • These synthetic samples might not capture the true underlying distribution of the data and can create an artificial structure that the model starts to memorize. This leads to overfitting, especially if the synthetic samples are close to each other, causing the model to learn the specific details of these synthetic samples rather than the general pattern of the minority class.
  • When using TF-IDF as the text representation, we ended up with high-dimensional, sparse data where each term in the vocabulary is a feature. High-dimensional data made it easier for models to overfit, as they can find spurious correlations between features and labels. After balancing the data with SMOTE, the model could have latched onto these noisy, artificial patterns in the high-dimensional feature space, resulting in overfitting.
  • Hyper-Parameter tuning We also attempted to improve the model performance using hyper-parameter tuning. It was found that the hyper-parameter tuned model trained on balanced training data was giving the same performance or degraded performance (mainly due to overfitting) compared to the base model trained on the imbalanced data. Therefore, we finalized the Naive Bayes Base model which was trained on the imbalanced training data.

Scope of Improvement¶

There is scope for further improvement by using Deep Learning Models (e.g., LSTM, BERT, RNN): Deep learning models, particularly LSTM (Long Short-Term Memory networks) and transformer-based models like BERT, can capture complex patterns and dependencies in text data that traditional models may not. They can outperform traditional models when sufficient training data is available and preprocessing is handled well.

Milestone 2:¶

  • Input: Preprocessed output from Milestone-1

Process:

  • Step 1: Design, train and test Neural networks classifiers
  • Step 2: Design, train and test RNN or LSTM classifiers
  • Step 3: Choose the best performing classifier and pickle it
  • Step 4: Final Report

Submission: Final report, Jupyter Notebook with all the steps in Milestone-1 and Milestone-2

Note: Guidance for Milestone-02 has been revised to use transformers like BERT to achieve improved performance of classification. The best performing classifier is also not pickled as suggested by the mentor.

In [ ]:
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Import Basic Libraries¶

In [ ]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Note:

  • We wanted to test if the transfomers like BERT model will perform better with raw data or cleaned and pre-processed data obtained from 1st Milestone. Therefore, we imported both raw data and cleaned data to test the performance of model with both kinds of data.
In [ ]:
# Import the original excel data as dataframe using pandas
# data_original = pd.read_excel("Data+Set+-+industrial_safety_and_health_database_with_accidents_description.xlsx")
data_original = pd.read_excel('/content/drive/MyDrive/Great Learning/Capstone/Data+Set+-+industrial_safety_and_health_database_with_accidents_description.xlsx')
data_original.head()
Out[ ]:
Unnamed: 0 Data Countries Local Industry Sector Accident Level Potential Accident Level Genre Employee or Third Party Critical Risk Description
0 0 2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f...
1 1 2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum...
2 2 2016-01-06 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170...
3 3 2016-01-08 Country_01 Local_04 Mining I I Male Third Party Others Being 9:45 am. approximately in the Nv. 1880 C...
4 4 2016-01-10 Country_01 Local_04 Mining IV IV Male Third Party Others Approximately at 11:45 a.m. in circumstances t...
In [ ]:
# Reading the cleaned and NLP pre-processed csv data created in 1st milestone steps as Dataframe
df4 = pd.read_csv('/content/drive/MyDrive/Great Learning/Capstone/cleaned_industrial_safety_data.csv')
df4.head()
Out[ ]:
Date Country City Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description Year Month Day Weekday Week of the Year Season count Preprocessed_Description
0 2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f... 2016 1 1 4 53 Summer 1 removing drill rod jumbo maintenance superviso...
1 2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum... 2016 1 2 5 53 Summer 1 activation sodium sulphide pump piping uncoupl...
2 2016-01-06 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170... 2016 1 6 2 1 Summer 1 substation milpo located level collaborator ex...
3 2016-01-08 Country_01 Local_04 Mining I I Male Third Party Others Being 9:45 am. approximately in the Nv. 1880 C... 2016 1 8 4 1 Summer 1 approximately nv cx695 ob7 personnel begin tas...
4 2016-01-10 Country_01 Local_04 Mining IV IV Male Third Party Others Approximately at 11:45 a.m. in circumstances t... 2016 1 10 6 1 Summer 1 approximately circumstance mechanic anthony gr...

First Approach: BERT Embeddings + Traditional Classifiers¶

In this approach, BERT embeddings are generated for the input text (like accident descriptions) and used as features in a traditional classifier, such as logistic regression, SVM, random forest, or gradient boosting.

Steps:

  • Use a pre-trained BERT model to extract sentence embeddings for each accident description.
  • Feed these embeddings as features to a classifier for multi-class accident level prediction.

Advantages:

  • *Interpretability*: Traditional classifiers, especially linear ones like logistic regression, can offer more interpretable results, as we can analyze feature importance and weights.
  • *Lower Computational Requirements*: Extracting embeddings and then training a classifier is generally faster and less resource-intensive than fine-tuning a transformer model on the entire dataset.
  • *Efficiency on Small Datasets*: This approach is often sufficient for smaller datasets, where training a full transformer model might lead to overfitting or be computationally prohibitive.
  • *Versatility*: We can experiment with different classifiers quickly (like logistic regression, SVM, etc.) to find which works best with the embeddings.

Limitations:

  • *Loss of Contextual Information*: The embeddings capture a general representation of the text but may lose some finer details compared to sequence classification, where the model is optimized end-to-end for the classification task.
  • *Suboptimal Performance on Complex Tasks*: For complex or nuanced text tasks, especially with substantial data, using pre-trained embeddings alone might be insufficient. Directly fine-tuning a transformer model can better capture specific context for classification.
  • *Lack of Task-Specific Adaptation*: The embeddings are generated based on the original training objectives of BERT (masked language modeling and next sentence prediction), which might not perfectly align with accident classification tasks.

This method is best to use when:

  • We have small to moderately sized datasets.
  • When interpretability or lower computational cost is critical.
  • When fine-tuning a full transformer model is infeasible due to resources.
In [ ]:
# Importing libraries to use transformers models
from transformers import BertTokenizer, BertModel, RobertaTokenizer, RobertaModel
import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

Defining the tokenizer and get embeddings function¶

In [ ]:
# Choose a model: 'bert-base-uncased' or 'roberta-base'
model_name = 'bert-base-uncased'  # or 'roberta-base' for RoBERTa

# Initialize tokenizer and model
if 'bert' in model_name:
    tokenizer = BertTokenizer.from_pretrained(model_name)
    model = BertModel.from_pretrained(model_name)
elif 'roberta' in model_name:
    tokenizer = RobertaTokenizer.from_pretrained(model_name)
    model = RobertaModel.from_pretrained(model_name)

def get_embeddings(text_list, tokenizer, model):
    """ Get embeddings for each text in text_list using tokenizer and model """
    with torch.no_grad():
        inputs = tokenizer(text_list, return_tensors="pt", padding=True, truncation=True, max_length=512)
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state[:, 0, :]  # Get the CLS token embedding
        return embeddings.numpy()
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]
vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]
config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Get Embeddings on cleaned data and Label Encoding¶

In [ ]:
# Get embeddings for descriptions
embeddings = get_embeddings(df4['Preprocessed_Description'].tolist(), tokenizer, model)

# Encode labels to numeric format
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df4['Accident Level'])
In [ ]:
# viewing the embeddings
embeddings
Out[ ]:
array([[-0.8366512 ,  0.05823708,  0.21985547, ..., -0.05338174,
        -0.1721213 , -0.01900132],
       [-0.77411646,  0.0161009 ,  0.28162766, ..., -0.28561983,
        -0.13525663,  0.719628  ],
       [-0.42129612,  0.14128   ,  0.16086406, ..., -0.21355191,
        -0.2534065 ,  0.27237302],
       ...,
       [-0.32311797,  0.1012077 , -0.01155421, ..., -0.43120098,
         0.01447114,  0.406252  ],
       [-0.40220645,  0.16602494, -0.06106976, ..., -0.54142606,
        -0.02233603,  0.11503189],
       [-0.42237413,  0.10597786,  0.29480186, ..., -0.32330784,
        -0.22980821,  0.339079  ]], dtype=float32)
In [ ]:
# viewing the encoded labels
y
Out[ ]:
array([0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 2, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 2, 4, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 2, 1, 0, 0, 0, 0, 1, 3, 0, 0, 0, 0,
       0, 3, 0, 1, 0, 0, 0, 0, 2, 0, 0, 1, 3, 0, 0, 2, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 3, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 4, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,
       3, 0, 0, 0, 0, 0, 0, 4, 2, 0, 0, 3, 0, 2, 2, 3, 2, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 3, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 2, 3, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 1, 0, 2, 0, 1, 2, 3, 0, 0, 1,
       0, 2, 0, 0, 3, 3, 0, 0, 3, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 3, 2, 0, 0, 0,
       0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 2, 3, 0, 0, 0, 0, 0, 4, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 3, 0, 1, 0, 2, 3, 0, 0, 0, 0,
       0, 3, 0, 3, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 4, 0, 1, 0, 1, 4,
       4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,
       3, 0, 0, 1, 0, 0, 2, 3, 1, 4, 0, 3, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 2, 0, 1, 3, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0])

Splitting data into train and test¶

In [ ]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(embeddings, y, test_size=0.2, random_state=42)

Different classifiers on cleaned data's BERT embeddings¶

In [ ]:
# Importing Required Libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
In [ ]:
# Train various classification models
classifier = {
    # "Naive_Bayes": MultinomialNB(),
    "Logistic_Regression": LogisticRegression(random_state=1),
    "SVM": SVC(random_state=1),
    "Random_Forest": RandomForestClassifier(random_state=1),
    "Gradient_Boosting": GradientBoostingClassifier(random_state=1)
}

# Initialize an empty list to store classification metrics
metrics_list = []

for model_name, clf in classifier.items():
    # Train the model
    clf.fit(X_train, y_train)

    # Make predictions on the test data
    y_pred = clf.predict(X_test)

    # Get the classification report as a string
    report_str = classification_report(y_test, y_pred, zero_division=0)

    # Get the classification report as a dictionary
    report_dict = classification_report(y_test, y_pred, output_dict=True, zero_division=0)

    # Extract accuracy, precision, recall, and F1-score (average metrics for each model)
    accuracy = accuracy_score(y_test, y_pred)
    precision = report_dict['weighted avg']['precision']
    recall = report_dict['weighted avg']['recall']
    f1_score = report_dict['weighted avg']['f1-score']

    # Append the metrics to the list
    metrics_list.append({
        'Model': model_name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1_score
    })

    # Print the classification report for the current model
    print(f"Classification Report for {model_name}:")
    print(report_str)
    print("=" * 60)  # Divider for clarity

# Convert the list of metrics into a DataFrame for comparison
metrics_comparison = pd.DataFrame(metrics_list)

# Display the comparison of metrics for each model
print("Comparison of Metrics:")
print(metrics_comparison)
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Classification Report for Logistic_Regression:
              precision    recall  f1-score   support

           0       0.74      0.92      0.82        63
           1       0.00      0.00      0.00         8
           2       0.00      0.00      0.00         4
           3       0.00      0.00      0.00         7
           4       0.00      0.00      0.00         2

    accuracy                           0.69        84
   macro avg       0.15      0.18      0.16        84
weighted avg       0.56      0.69      0.62        84

============================================================
Classification Report for SVM:
              precision    recall  f1-score   support

           0       0.75      1.00      0.86        63
           1       0.00      0.00      0.00         8
           2       0.00      0.00      0.00         4
           3       0.00      0.00      0.00         7
           4       0.00      0.00      0.00         2

    accuracy                           0.75        84
   macro avg       0.15      0.20      0.17        84
weighted avg       0.56      0.75      0.64        84

============================================================
Classification Report for Random_Forest:
              precision    recall  f1-score   support

           0       0.75      1.00      0.86        63
           1       0.00      0.00      0.00         8
           2       0.00      0.00      0.00         4
           3       0.00      0.00      0.00         7
           4       0.00      0.00      0.00         2

    accuracy                           0.75        84
   macro avg       0.15      0.20      0.17        84
weighted avg       0.56      0.75      0.64        84

============================================================
Classification Report for Gradient_Boosting:
              precision    recall  f1-score   support

           0       0.76      0.90      0.83        63
           1       0.29      0.25      0.27         8
           2       0.00      0.00      0.00         4
           3       0.00      0.00      0.00         7
           4       0.00      0.00      0.00         2

    accuracy                           0.70        84
   macro avg       0.21      0.23      0.22        84
weighted avg       0.60      0.70      0.64        84

============================================================
Comparison of Metrics:
                 Model  Accuracy  Precision    Recall  F1-Score
0  Logistic_Regression  0.690476   0.557692  0.690476  0.617021
1                  SVM  0.750000   0.562500  0.750000  0.642857
2        Random_Forest  0.750000   0.562500  0.750000  0.642857
3    Gradient_Boosting  0.702381   0.597211  0.702381  0.644962
Insights:¶
  • It is observed that test accuracy of all the models falls in the range of 69-75% and test F-1 score falls in the range of 61-64% range.

Get Embeddings, Label Encoding on raw data and Create Train and Test Data¶

In [ ]:
# Get embeddings for raw descriptions
embeddings = get_embeddings(data_original['Description'].tolist(), tokenizer, model)

# Encode labels to numeric format
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(data_original['Accident Level'])

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(embeddings, y, test_size=0.2, random_state=42)

Different classifiers on raw data's BERT embeddings¶

In [ ]:
# Train various classification models on raw data
classifier = {
    # "Naive_Bayes": MultinomialNB(),
    "Logistic_Regression": LogisticRegression(random_state=1),
    "SVM": SVC(random_state=1),
    "Random_Forest": RandomForestClassifier(random_state=1),
    "Gradient_Boosting": GradientBoostingClassifier(random_state=1)
}

# Initialize an empty list to store classification metrics
metrics_list = []

for model_name, clf in classifier.items():
    # Train the model
    clf.fit(X_train, y_train)

    # Make predictions on the test data
    y_pred = clf.predict(X_test)

    # Get the classification report as a string
    report_str = classification_report(y_test, y_pred, zero_division=0)

    # Get the classification report as a dictionary
    report_dict = classification_report(y_test, y_pred, output_dict=True, zero_division=0)

    # Extract accuracy, precision, recall, and F1-score (average metrics for each model)
    accuracy = accuracy_score(y_test, y_pred)
    precision = report_dict['weighted avg']['precision']
    recall = report_dict['weighted avg']['recall']
    f1_score = report_dict['weighted avg']['f1-score']

    # Append the metrics to the list
    metrics_list.append({
        'Model': model_name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1_score
    })

    # Print the classification report for the current model
    print(f"Classification Report for {model_name}:")
    print(report_str)
    print("=" * 60)  # Divider for clarity

# Convert the list of metrics into a DataFrame for comparison
metrics_comparison = pd.DataFrame(metrics_list)

# Display the comparison of metrics for each model
print("Comparison of Metrics:")
print(metrics_comparison)
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Classification Report for Logistic_Regression:
              precision    recall  f1-score   support

           0       0.83      0.94      0.88        68
           1       0.20      0.17      0.18         6
           2       0.00      0.00      0.00         5
           3       0.00      0.00      0.00         4
           4       0.00      0.00      0.00         2

    accuracy                           0.76        85
   macro avg       0.21      0.22      0.21        85
weighted avg       0.68      0.76      0.72        85

============================================================
Classification Report for SVM:
              precision    recall  f1-score   support

           0       0.80      1.00      0.89        68
           1       0.00      0.00      0.00         6
           2       0.00      0.00      0.00         5
           3       0.00      0.00      0.00         4
           4       0.00      0.00      0.00         2

    accuracy                           0.80        85
   macro avg       0.16      0.20      0.18        85
weighted avg       0.64      0.80      0.71        85

============================================================
Classification Report for Random_Forest:
              precision    recall  f1-score   support

           0       0.80      1.00      0.89        68
           1       0.00      0.00      0.00         6
           2       0.00      0.00      0.00         5
           3       0.00      0.00      0.00         4
           4       0.00      0.00      0.00         2

    accuracy                           0.80        85
   macro avg       0.16      0.20      0.18        85
weighted avg       0.64      0.80      0.71        85

============================================================
Classification Report for Gradient_Boosting:
              precision    recall  f1-score   support

           0       0.79      0.93      0.85        68
           1       0.00      0.00      0.00         6
           2       0.00      0.00      0.00         5
           3       0.00      0.00      0.00         4
           4       0.00      0.00      0.00         2

    accuracy                           0.74        85
   macro avg       0.16      0.19      0.17        85
weighted avg       0.63      0.74      0.68        85

============================================================
Comparison of Metrics:
                 Model  Accuracy  Precision    Recall  F1-Score
0  Logistic_Regression  0.764706   0.679053  0.764706  0.719041
1                  SVM  0.800000   0.640000  0.800000  0.711111
2        Random_Forest  0.800000   0.640000  0.800000  0.711111
3    Gradient_Boosting  0.741176   0.630000  0.741176  0.681081
Insights:¶
  • It is observed that test accuracy of all the models trained on raw data's BERT embeddings falls in the range of 74-80% and test F-1 score falls in the range of 68-72% range, Whereas test accuracy of all the models trained on cleaned data's BERT embeddings falls in the range of 69-75% and test F-1 score falls in the range of 61-64% range.

Limitations and why we need fine tuning:¶

1. Loss of Contextual Information¶

  • BERT Embeddings Limitation:

    • Pre-trained BERT embeddings are often extracted as static embeddings (e.g., averaging token embeddings or using the [CLS] token). This approach loses some of the deep contextual information that BERT captures.
    • Traditional classifiers cannot effectively leverage the token-level contextual representations or the complex relationships between words.
  • Why Fine-Tuning Helps:

    • Fine-tuning transformers allows the model to dynamically adjust the embeddings based on the specific task, preserving and enhancing contextual relationships.

2. Inefficient Handling of Sequence-Level Features¶

  • BERT Embeddings Limitation:

    • Traditional classifiers do not inherently consider the sequential nature of text data, potentially missing patterns like:
      • Word order.
      • Semantic dependencies between tokens or phrases.
  • Why Fine-Tuning Helps:

    • Fine-tuned transformers use self-attention mechanisms to model relationships across the sequence, enabling a more nuanced understanding of the accident descriptions.

3. Limited Adaptation to Task-Specific Data¶

  • BERT Embeddings Limitation:

    • The embeddings are fixed during traditional classification, meaning they are not updated based on the accident severity task.
    • Domain-specific language nuances (e.g., technical accident terminology) are not learned.
  • Why Fine-Tuning Helps:

    • Fine-tuning adjusts the entire model, including embeddings, to the target dataset, enhancing the ability to capture task-specific patterns and terminology.

4. Class Imbalance Challenges¶

  • BERT Embeddings Limitation:

    • Traditional classifiers, especially tree-based models, can struggle with imbalanced datasets. They often bias predictions toward the majority class, even when using techniques like SMOTE or class weighting.
  • Why Fine-Tuning Helps:

    • Transformers can be fine-tuned with techniques like weighted loss functions or custom sampling during training, enabling better performance on minority classes (e.g., "High" severity accidents).

5. Scalability and Computational Overheads¶

  • BERT Embeddings Limitation:

    • Generating BERT embeddings for large datasets can be computationally expensive, especially if embeddings are re-computed for each inference or training instance.
  • Why Fine-Tuning Helps:

    • Although fine-tuning transformers is computationally intensive during training, it provides an end-to-end model that reduces the inference-time complexity compared to combining embeddings and external classifiers.

6. Lack of End-to-End Optimization¶

  • BERT Embeddings Limitation:

    • When using traditional classifiers, embeddings and classifiers are optimized independently. This decoupling can lead to suboptimal performance because:
      • Classifiers may not effectively leverage embeddings.
      • The embeddings are not tailored for the task.
  • Why Fine-Tuning Helps:

    • Fine-tuning provides a seamless end-to-end learning framework where the embeddings, transformer layers, and classification head are jointly optimized for the accident severity task.

7. Difficulty in Capturing Rare and Subtle Patterns¶

  • BERT Embeddings Limitation:

    • Traditional classifiers like Random Forest or Gradient Boosting rely on feature importance derived from embeddings. Rare patterns in accident descriptions (e.g., specific technical terms indicating "High" severity) may not be captured effectively.
  • Why Fine-Tuning Helps:

    • The self-attention mechanism in transformers excels at capturing subtle and long-range dependencies, making it easier to detect rare but critical patterns.

8. Impact on Fine-Grained Classifications¶

  • BERT Embeddings Limitation:

    • For multi-class problems like accident severity classification, traditional classifiers might struggle with fine-grained distinctions (e.g., "Medium" vs. "High").
  • Why Fine-Tuning Helps:

    • Fine-tuned transformers can leverage token-level distinctions and context to make precise predictions, resulting in improved performance for nuanced categories.

9. Handling Ambiguity in Descriptions¶

  • BERT Embeddings Limitation:

    • Accident descriptions can be ambiguous or vague. Traditional classifiers often rely on static features, leading to reduced robustness in such cases.
  • Why Fine-Tuning Helps:

    • Fine-tuned transformers adjust weights dynamically, learning how to disambiguate complex or multi-layered descriptions during training.

10. Lack of Model Interpretability¶

  • BERT Embeddings Limitation:

    • Tree-based classifiers like Random Forest offer feature importance, but these are often hard to map back to meaningful text-based features from embeddings.
  • Why Fine-Tuning Helps:

    • Techniques like attention visualization in transformers can provide insights into which parts of the text contribute most to the classification, improving interpretability.

Summary: Key Advantages of Fine-Tuning Over Embedding-Based Classification¶

Aspect BERT Embeddings + Classifiers Fine-Tuned Transformers
Contextual Understanding Limited to static embeddings Dynamic, task-specific contextualization
Handling Sequential Data Poor Excellent (self-attention)
Adaptation to Specific Data None High
Imbalanced Data Handling Difficult Better with weighted loss or sampling
Training Efficiency Faster Computationally intensive but optimized
Prediction Quality Decoupled; suboptimal End-to-end optimized
Interpretability Weak (via embeddings) Better (via attention mechanisms)

Recommendation¶

To improve accident severity classification:

  • Use fine-tuned RoBERTa or BERT like models as the primary model for sequence classification.
  • Reserve traditional classifiers for simpler tasks or as benchmarks for performance comparison. Fine-tuned transformers offer clear advantages in capturing context, handling imbalanced classes, and achieving higher accuracy and F1-scores.

Note: The classifiers are performing better on raw data. Therefore, we will try to create more models on raw data and check if we are getting improved performance

  • Preprocessing is not needed when using pre-trained language representation models like BERT. It uses all of the information in a sentence, punctuation and stop-words from a wide range of perspectives by leveraging a multi-head self attention mechanism.

Second Approach: Fine-Tuning Transformer Models for Sequence Classification¶

Here, a pre-trained model (like BERT, RoBERTa, DistilBERT, XLNet, or ALBERT) is fine-tuned specifically for accident level classification by adding a classification layer and optimizing the model end-to-end on the accident description data.

Steps:

  • Use a pre-trained model and add a classification layer.
  • Fine-tune the entire model, optimizing it for accident level classification on the specific dataset.

Advantages:

  • Task-Specific Optimization: Fine-tuning adjusts the model’s parameters specifically for accident classification, often resulting in higher accuracy than using general-purpose embeddings.
  • Superior Performance on Large Datasets: Transformer models can leverage large datasets well, extracting intricate patterns that are often crucial for text classification tasks.
  • Better Handling of Class Imbalances and Nuances: Models can better handle nuanced differences between classes when trained end-to-end, which is especially useful for tasks with a high degree of overlap in descriptions.

Contextual Understanding: By fine-tuning, the model learns to pay attention to specific accident-related language, providing a deeper and more context-aware representation.

Limitations:

  • Higher Computational Requirements: Fine-tuning transformers is computationally expensive, especially with large models, as it requires significant GPU resources.
  • Risk of Overfitting on Small Datasets: On small datasets, fine-tuning can lead to overfitting, especially if regularization techniques are not applied.

Longer Training Time: End-to-end training requires more time compared to using precomputed embeddings and classifiers.

  • Complexity in Hyperparameter Tuning: Fine-tuning involves several hyperparameters (e.g., learning rate, batch size, number of epochs) which must be tuned to avoid underfitting or overfitting.

Best Use Cases:

  • Medium to large datasets where computational resources are available.

When higher accuracy is crucial, and the subtle nuances in the descriptions are essential for classification.

  • When the dataset size supports fine-tuning without significant overfitting risks.

Using Models like BERT, Roberta, DistilBert, XLNet, Albert on Raw Data¶

Below is the brief description on why we selected these 5 models to classify accident levels and get improved model performance:

  1. BERT (Bidirectional Encoder Representations from Transformers): Good for general NLP classification tasks with context-rich descriptions. You can start with bert-base-uncased.
  2. DistilBERT: A smaller, faster variant of BERT with comparable accuracy for many tasks. If computational efficiency is a concern, distilbert-base-uncased is a good choice.
  3. RoBERTa: A robustly optimized variant of BERT with superior performance on longer sequences (roberta-base or roberta-large).
  4. XLNet: If the accident descriptions often include unusual phrasing or complex dependencies, XLNet may be more effective.
  5. ALBERT: Good for large datasets due to its efficiency and lower memory footprint, without sacrificing much accuracy.

Below is the overview of the steps we are likely to follow:

  1. Load a pre-trained model (e.g., BERT) and tokenize the accident descriptions using the corresponding tokenizer.
  2. Add a classification head (fully connected layer) for multi-class classification.
  3. Fine-tune the model on the dataset using the accident severity level as the target variable.
In [ ]:
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
# importing basic libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
In [ ]:
# Import the original excel data as dataframe using pandas
# data_original = pd.read_excel("Data+Set+-+industrial_safety_and_health_database_with_accidents_description.xlsx")
data_original = pd.read_excel('/content/drive/MyDrive/Great Learning/Capstone/Data+Set+-+industrial_safety_and_health_database_with_accidents_description.xlsx')
data_original.head()
Out[ ]:
Unnamed: 0 Data Countries Local Industry Sector Accident Level Potential Accident Level Genre Employee or Third Party Critical Risk Description
0 0 2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f...
1 1 2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum...
2 2 2016-01-06 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170...
3 3 2016-01-08 Country_01 Local_04 Mining I I Male Third Party Others Being 9:45 am. approximately in the Nv. 1880 C...
4 4 2016-01-10 Country_01 Local_04 Mining IV IV Male Third Party Others Approximately at 11:45 a.m. in circumstances t...

Import Required Libraries¶

In [ ]:
!pip install datasets
Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.16.1)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.26.4)
Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (17.0.0)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (2.2.2)
Requirement already satisfied: requests>=2.32.2 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.32.3)
Requirement already satisfied: tqdm>=4.66.3 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.66.6)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.10.10)
Requirement already satisfied: huggingface-hub>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.24.7)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (24.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.2)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.4.3)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (24.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.5.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.1.0)
Requirement already satisfied: yarl<2.0,>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.17.1)
Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.0->datasets) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2024.8.30)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)
Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from yarl<2.0,>=1.12.0->aiohttp->datasets) (0.2.0)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 480.6/480.6 kB 9.6 MB/s eta 0:00:00
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 11.9 MB/s eta 0:00:00
Downloading fsspec-2024.9.0-py3-none-any.whl (179 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 179.3/179.3 kB 17.8 MB/s eta 0:00:00
Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 13.9 MB/s eta 0:00:00
Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.1/194.1 kB 20.0 MB/s eta 0:00:00
Installing collected packages: xxhash, fsspec, dill, multiprocess, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2024.10.0
    Uninstalling fsspec-2024.10.0:
      Successfully uninstalled fsspec-2024.10.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.
Successfully installed datasets-3.1.0 dill-0.3.8 fsspec-2024.9.0 multiprocess-0.70.16 xxhash-3.5.0
In [ ]:
import torch
from transformers import (
    BertTokenizer, BertForSequenceClassification,
    RobertaTokenizer, RobertaForSequenceClassification,
    DistilBertTokenizer, DistilBertForSequenceClassification,
    XLNetTokenizer, XLNetForSequenceClassification,
    AlbertTokenizer, AlbertForSequenceClassification,
    Trainer, TrainingArguments
)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from datasets import Dataset
import pandas as pd

Load and Preprocess Data¶

In [ ]:
# Load data
df = data_original.copy()
df = df[['Description', 'Accident Level']]  # Columns with text and labels

# Encode labels
label_encoder = LabelEncoder()
df['labels'] = label_encoder.fit_transform(df['Accident Level'])

# Split data into train and test
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df['Description'].tolist(), df['labels'].tolist(), test_size=0.2, random_state=42
)

Define Model Selection, Training Process, Run Each Model, Evaluate Performance and Compare Model Results.¶

In [ ]:
import warnings
import pandas as pd
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from datasets import Dataset

# Suppress all warnings
warnings.filterwarnings("ignore")

# Define a custom metric function to calculate accuracy, precision, recall, and F1
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)  # Get the index of the highest prediction score
    accuracy = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="weighted")
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
    }

# Update training function to log training loss and evaluation accuracy, and store metrics for each epoch
def train_model(model_name, tokenizer_class, model_class, train_texts, train_labels, test_texts, test_labels, label_encoder):
    tokenizer = tokenizer_class.from_pretrained(model_name)
    model = model_class.from_pretrained(model_name, num_labels=len(label_encoder.classes_))

    # Tokenize the data
    train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128)
    test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=128)

    # Convert to Hugging Face Dataset format
    train_dataset = Dataset.from_dict({'input_ids': train_encodings['input_ids'], 'attention_mask': train_encodings['attention_mask'], 'labels': train_labels})
    test_dataset = Dataset.from_dict({'input_ids': test_encodings['input_ids'], 'attention_mask': test_encodings['attention_mask'], 'labels': test_labels})

    # Define training arguments with logging enabled
    training_args = TrainingArguments(
        output_dir='./results',
        evaluation_strategy="epoch",
        logging_strategy="steps",
        logging_steps=10,  # Log every 10 steps
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=3,
        weight_decay=0.01,
        logging_dir='./logs',
        save_strategy="no",


        # Added these lines
        # load_best_model_at_end=True, # loads the best model according to metric_for_best_model
        # metric_for_best_model="accuracy"

    )

    # Initialize Trainer with the custom metric function
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        compute_metrics=compute_metrics
    )

    # Train the model and log loss
    trainer.train()

    # Evaluate the model and return accuracy results for each epoch
    eval_results = trainer.evaluate()
    return eval_results, trainer

# Dictionary to store evaluation results for each model
model_results = {}

# Define models and tokenizers
models = {
    "BERT": (BertTokenizer, BertForSequenceClassification, 'bert-base-uncased'),
    "RoBERTa": (RobertaTokenizer, RobertaForSequenceClassification, 'roberta-base'),
    "DistilBERT": (DistilBertTokenizer, DistilBertForSequenceClassification, 'distilbert-base-uncased'),
    "XLNet": (XLNetTokenizer, XLNetForSequenceClassification, 'xlnet-base-cased'),
    "ALBERT": (AlbertTokenizer, AlbertForSequenceClassification, 'albert-base-v2')
}

# Dataframe to store final epoch metrics for each model
final_results_df = pd.DataFrame(columns=["Model", "Accuracy", "Precision", "Recall", "F1"])

# Train and evaluate each model
for model_name, (tokenizer_class, model_class, model_pretrained) in models.items():
    print(f"Training {model_name} model...")
    eval_results, trainer = train_model(model_pretrained, tokenizer_class, model_class, train_texts, train_labels, test_texts, test_labels, label_encoder)
    model_results[model_name] = eval_results

    # Extract metrics from the final epoch and store in summary DataFrame
    final_metrics = {
        "Model": model_name,
        "Accuracy": eval_results["eval_accuracy"],
        "Precision": eval_results["eval_precision"],
        "Recall": eval_results["eval_recall"],
        "F1": eval_results["eval_f1"]
    }
    # final_results_df = final_results_df.append(final_metrics, ignore_index=True)
    final_results_df = pd.concat([final_results_df, pd.DataFrame([final_metrics])], ignore_index=True)
    print(f"Finished training {model_name}.")
    print(f"Evaluation Results for {model_name}: {eval_results}")

# Sort results by Accuracy and F1 Score in descending order
final_results_accuracy = final_results_df.sort_values(by="Accuracy", ascending=False).reset_index(drop=True)
final_results_f1 = final_results_df.sort_values(by="F1", ascending=False).reset_index(drop=True)

# Display the sorted tables
print("\nFinal Results Sorted by Accuracy:\n", final_results_accuracy)
print("\nFinal Results Sorted by F1 Score:\n", final_results_f1)
Training BERT model...
tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]
vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]
config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:
 ··········
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
Tracking run with wandb version 0.18.5
Run data is saved locally in /content/wandb/run-20241109_040042-m52zoq4k
Syncing run ./results to Weights & Biases (docs)
View project at https://wandb.ai/animeshjohri18-london-stock-exchange-group/huggingface
View run at https://wandb.ai/animeshjohri18-london-stock-exchange-group/huggingface/runs/m52zoq4k
[129/129 00:07, Epoch 3/3]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 0.954600 0.790300 0.800000 0.640000 0.800000 0.711111
2 0.672800 0.749937 0.800000 0.640000 0.800000 0.711111
3 0.776000 0.791157 0.764706 0.641975 0.764706 0.697987

[11/11 00:00]
Finished training BERT.
Evaluation Results for BERT: {'eval_loss': 0.7911568284034729, 'eval_accuracy': 0.7647058823529411, 'eval_precision': 0.6419753086419753, 'eval_recall': 0.7647058823529411, 'eval_f1': 0.697986577181208, 'eval_runtime': 0.1858, 'eval_samples_per_second': 457.413, 'eval_steps_per_second': 59.195, 'epoch': 3.0}
Training RoBERTa model...
tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]
vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]
merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]
config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[129/129 00:08, Epoch 3/3]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 0.956200 0.772401 0.800000 0.640000 0.800000 0.711111
2 0.713700 0.796836 0.800000 0.640000 0.800000 0.711111
3 0.898600 0.761021 0.800000 0.640000 0.800000 0.711111

[11/11 00:00]
Finished training RoBERTa.
Evaluation Results for RoBERTa: {'eval_loss': 0.7610211372375488, 'eval_accuracy': 0.8, 'eval_precision': 0.64, 'eval_recall': 0.8, 'eval_f1': 0.7111111111111111, 'eval_runtime': 0.2072, 'eval_samples_per_second': 410.156, 'eval_steps_per_second': 53.079, 'epoch': 3.0}
Training DistilBERT model...
tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]
vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]
config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[129/129 00:04, Epoch 3/3]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 0.949700 0.770693 0.800000 0.640000 0.800000 0.711111
2 0.683800 0.775398 0.800000 0.640000 0.800000 0.711111
3 0.835000 0.771040 0.800000 0.640000 0.800000 0.711111

[11/11 00:00]
Finished training DistilBERT.
Evaluation Results for DistilBERT: {'eval_loss': 0.7710402607917786, 'eval_accuracy': 0.8, 'eval_precision': 0.64, 'eval_recall': 0.8, 'eval_f1': 0.7111111111111111, 'eval_runtime': 0.1309, 'eval_samples_per_second': 649.177, 'eval_steps_per_second': 84.011, 'epoch': 3.0}
Training XLNet model...
spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]
config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]
pytorch_model.bin:   0%|          | 0.00/467M [00:00<?, ?B/s]
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['logits_proj.bias', 'logits_proj.weight', 'sequence_summary.summary.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[129/129 00:10, Epoch 3/3]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 0.955400 0.768249 0.800000 0.640000 0.800000 0.711111
2 0.722200 0.808175 0.800000 0.640000 0.800000 0.711111
3 0.804900 0.733474 0.800000 0.640000 0.800000 0.711111

[11/11 00:00]
Finished training XLNet.
Evaluation Results for XLNet: {'eval_loss': 0.7334736585617065, 'eval_accuracy': 0.8, 'eval_precision': 0.64, 'eval_recall': 0.8, 'eval_f1': 0.7111111111111111, 'eval_runtime': 0.2529, 'eval_samples_per_second': 336.106, 'eval_steps_per_second': 43.496, 'epoch': 3.0}
Training ALBERT model...
tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]
spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]
config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]
Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[129/129 00:07, Epoch 3/3]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 0.973600 0.779485 0.800000 0.640000 0.800000 0.711111
2 0.726500 0.808522 0.800000 0.640000 0.800000 0.711111
3 0.855300 0.762096 0.800000 0.640000 0.800000 0.711111

[11/11 00:00]
Finished training ALBERT.
Evaluation Results for ALBERT: {'eval_loss': 0.762096107006073, 'eval_accuracy': 0.8, 'eval_precision': 0.64, 'eval_recall': 0.8, 'eval_f1': 0.7111111111111111, 'eval_runtime': 0.2124, 'eval_samples_per_second': 400.263, 'eval_steps_per_second': 51.799, 'epoch': 3.0}

Final Results Sorted by Accuracy:
         Model  Accuracy  Precision    Recall        F1
0     RoBERTa  0.800000   0.640000  0.800000  0.711111
1       XLNet  0.800000   0.640000  0.800000  0.711111
2  DistilBERT  0.800000   0.640000  0.800000  0.711111
3      ALBERT  0.800000   0.640000  0.800000  0.711111
4        BERT  0.764706   0.641975  0.764706  0.697987

Final Results Sorted by F1 Score:
         Model  Accuracy  Precision    Recall        F1
0     RoBERTa  0.800000   0.640000  0.800000  0.711111
1       XLNet  0.800000   0.640000  0.800000  0.711111
2  DistilBERT  0.800000   0.640000  0.800000  0.711111
3      ALBERT  0.800000   0.640000  0.800000  0.711111
4        BERT  0.764706   0.641975  0.764706  0.697987
Insights:¶
  • The best test accuracy(80%) and best test F1 score(71%) of RoBERTa model is similar to the first approach where BERT embeddings and traditional classifers were used where test accuracy was the in the range of 74-80% and test F-1 score falls in the range of 68-72%.

The above code is fine-tuning a pre-trained transformer model (like BERT, RoBERTa, DistilBERT, XLNet, or ALBERT) for accident level classification.

Code Analysis and Function¶
  1. Fine-Tuning with Custom Classification Layer:

    • Each transformer model (e.g., BERT, RoBERTa, etc.) is loaded using its pre-trained weights (e.g., 'bert-base-uncased' for BERT).
    • A custom classification layer with num_labels equal to the number of accident levels is added to the model. This layer is trained from scratch, adapting the model specifically for accident level classification.
  2. Tokenization and Dataset Preparation:

    • The accident descriptions are tokenized using the tokenizer corresponding to each model, truncating or padding as necessary to ensure a uniform input size.
    • The tokenized data is then converted into a format compatible with Hugging Face’s Dataset class, making it compatible with the Trainer.
  3. TrainingArguments for Fine-Tuning:

    • The TrainingArguments include multiple fine-tuning-specific configurations, such as evaluation_strategy="epoch", num_train_epochs=3, and weight_decay=0.01. These parameters are chosen for fine-tuning purposes rather than training from scratch.
    • Fine-tuning requires adjustments in model parameters and hyperparameters to optimize for the accident classification task specifically, such as learning_rate, batch_size, and num_train_epochs.
  4. Trainer and Training:

    • The Trainer is initialized with the model, training arguments, and a compute_metrics function to evaluate accuracy, precision, recall, and F1 score during training.
    • Fine-tuning occurs when trainer.train() is called. It updates not only the classification layer but also the transformer’s pre-trained layers, adapting the model for the accident classification task.
  5. Evaluation and Results Storage:

    • After training, the model is evaluated on the test dataset, and evaluation metrics from the final epoch are extracted and stored.
    • The final_results_df is created to summarize and compare model performance (sorted by accuracy and F1 score) across different transformer models.

Summary:¶

This Code Performs Fine-Tuning:

  • The code adapts pre-trained transformers to a new classification task (accident level prediction) by adding a task-specific classification head and adjusting model weights to improve performance on this particular task.
  • This approach leverages the existing language understanding in transformers, which has been acquired from extensive pre-training on large datasets.
  • Requires less data and computational resources than training from scratch.
  • Leads to better performance than using pre-computed embeddings, as the entire model is optimized for the task.
In [ ]:
# Final Results Sorted by Accuracy
final_results_accuracy
Out[ ]:
Model Accuracy Precision Recall F1
0 RoBERTa 0.800000 0.640000 0.800000 0.711111
1 XLNet 0.800000 0.640000 0.800000 0.711111
2 DistilBERT 0.800000 0.640000 0.800000 0.711111
3 ALBERT 0.800000 0.640000 0.800000 0.711111
4 BERT 0.764706 0.641975 0.764706 0.697987
In [ ]:
# Final Results Sorted by F1 score
final_results_f1
Out[ ]:
Model Accuracy Precision Recall F1
0 RoBERTa 0.800000 0.640000 0.800000 0.711111
1 XLNet 0.800000 0.640000 0.800000 0.711111
2 DistilBERT 0.800000 0.640000 0.800000 0.711111
3 ALBERT 0.800000 0.640000 0.800000 0.711111
4 BERT 0.764706 0.641975 0.764706 0.697987

Insights:

  • We are getting test accuracy in the range of 76-80% and test F1-score in the range of 69-71%.
  • The result is only giving test metrics

Exploring ways to get train and test metrics and trying to get better metrics after increasing the epochs to 10 in the below step:¶

In [ ]:
from transformers import TrainingArguments, Trainer
from datasets import Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import pandas as pd
import warnings
import torch

warnings.filterwarnings("ignore")

# Function to compute metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    accuracy = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

# Initialize a list to store results for each model
model_results = []

# Training function for each model
def train_and_evaluate_model(model_pretrained, tokenizer_class, model_class, train_texts, train_labels, test_texts, test_labels):
    tokenizer = tokenizer_class.from_pretrained(model_pretrained)
    model = model_class.from_pretrained(model_pretrained, num_labels=len(set(train_labels)))

    # Move all model parameters to contiguous memory if necessary
    for param in model.parameters():
        if not param.is_contiguous():
            param.data = param.data.contiguous()

    # Tokenize the datasets
    train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512)
    test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=512)

    # Convert tokenized data to Dataset format
    train_dataset = Dataset.from_dict({
        "input_ids": train_encodings["input_ids"],
        "attention_mask": train_encodings["attention_mask"],
        "labels": train_labels
    })
    test_dataset = Dataset.from_dict({
        "input_ids": test_encodings["input_ids"],
        "attention_mask": test_encodings["attention_mask"],
        "labels": test_labels
    })

    # Define training arguments
    training_args = TrainingArguments(
        output_dir='./results',
        num_train_epochs=10,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir='./logs',
        logging_steps=10,
        evaluation_strategy="epoch"
    )

    # Define Trainer with custom training evaluation to get train metrics
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        compute_metrics=compute_metrics
    )

    # Train and evaluate
    trainer.train()
    # Explicitly evaluate on training data for metrics
    train_metrics = trainer.evaluate(train_dataset)
    eval_metrics = trainer.evaluate(test_dataset)

    return train_metrics, eval_metrics

# Define models and tokenizers
models = {
    "BERT": (BertTokenizer, BertForSequenceClassification, 'bert-base-uncased'),
    "RoBERTa": (RobertaTokenizer, RobertaForSequenceClassification, 'roberta-base'),
    "DistilBERT": (DistilBertTokenizer, DistilBertForSequenceClassification, 'distilbert-base-uncased'),
    "XLNet": (XLNetTokenizer, XLNetForSequenceClassification, 'xlnet-base-cased'),
    "ALBERT": (AlbertTokenizer, AlbertForSequenceClassification, 'albert-base-v2')
}

# Loop through models
for model_name, (tokenizer_class, model_class, model_pretrained) in models.items():
    print(f"Training {model_name}...")
    train_metrics, eval_metrics = train_and_evaluate_model(model_pretrained, tokenizer_class, model_class, train_texts, train_labels, test_texts, test_labels)

    # Append the last epoch's metrics for each model to model_results list
    model_results.append({
        "Model": model_name,
        "Train Accuracy": train_metrics.get("eval_accuracy", 0),
        "Validation Accuracy": eval_metrics.get("eval_accuracy", 0),
        "Train Precision": train_metrics.get("eval_precision", 0),
        "Validation Precision": eval_metrics.get("eval_precision", 0),
        "Train Recall": train_metrics.get("eval_recall", 0),
        "Validation Recall": eval_metrics.get("eval_recall", 0),
        "Train F1": train_metrics.get("eval_f1", 0),
        "Validation F1": eval_metrics.get("eval_f1", 0),
    })

# Convert results to DataFrame and display sorted tables
results_df = pd.DataFrame(model_results)

# Display summary tables
print("Summary Table - Ordered by Validation Accuracy (Descending):")
print(results_df.sort_values(by="Validation Accuracy", ascending=False).to_string(index=False))

print("\nSummary Table - Ordered by Validation F1 Score (Descending):")
print(results_df.sort_values(by="Validation F1", ascending=False).to_string(index=False))
Training BERT...
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 00:41, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 1.119300 0.969382 0.800000 0.640000 0.800000 0.711111
2 0.782600 0.783712 0.800000 0.640000 0.800000 0.711111
3 0.890900 0.805560 0.800000 0.640000 0.800000 0.711111
4 0.823600 0.728813 0.800000 0.640000 0.800000 0.711111
5 0.796600 0.813704 0.776471 0.679641 0.776471 0.724306
6 0.680100 0.785266 0.729412 0.704342 0.729412 0.716431
7 0.436700 0.808377 0.788235 0.733333 0.788235 0.741348
8 0.477600 0.921163 0.776471 0.660000 0.776471 0.713514
9 0.344100 1.252109 0.705882 0.654418 0.705882 0.678915
10 0.287100 1.257946 0.800000 0.647619 0.800000 0.715789

[43/43 00:01]
Training RoBERTa...
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 00:45, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 1.439200 1.310259 0.800000 0.640000 0.800000 0.711111
2 0.759800 0.785657 0.800000 0.640000 0.800000 0.711111
3 0.869900 0.766365 0.800000 0.640000 0.800000 0.711111
4 0.866900 0.769237 0.800000 0.640000 0.800000 0.711111
5 0.851700 0.762193 0.800000 0.640000 0.800000 0.711111
6 0.734900 0.891209 0.694118 0.664789 0.694118 0.679137
7 0.553900 0.880783 0.682353 0.717647 0.682353 0.689412
8 0.553800 1.259274 0.682353 0.653521 0.682353 0.667626
9 0.495300 1.482099 0.623529 0.684706 0.623529 0.645343
10 0.224400 1.618874 0.658824 0.695343 0.658824 0.676078

[43/43 00:01]
Training DistilBERT...
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 00:24, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 1.358000 1.196756 0.800000 0.640000 0.800000 0.711111
2 0.783100 0.771782 0.800000 0.640000 0.800000 0.711111
3 0.889200 0.752803 0.800000 0.640000 0.800000 0.711111
4 0.809300 0.765685 0.800000 0.640000 0.800000 0.711111
5 0.717000 0.764758 0.800000 0.640000 0.800000 0.711111
6 0.685600 0.801644 0.788235 0.663617 0.788235 0.720490
7 0.404100 1.021075 0.717647 0.668493 0.717647 0.692199
8 0.413800 1.242371 0.600000 0.697150 0.600000 0.640632
9 0.256800 1.476683 0.564706 0.668908 0.564706 0.609235
10 0.194500 1.912631 0.517647 0.694342 0.517647 0.586964

[43/43 00:00]
Training XLNet...
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['logits_proj.bias', 'logits_proj.weight', 'sequence_summary.summary.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 00:56, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 1.056800 0.797825 0.800000 0.640000 0.800000 0.711111
2 0.724400 0.752655 0.800000 0.640000 0.800000 0.711111
3 0.840100 0.734160 0.800000 0.640000 0.800000 0.711111
4 0.725900 0.697344 0.800000 0.640000 0.800000 0.711111
5 0.728300 0.732013 0.788235 0.645783 0.788235 0.709934
6 0.591000 0.765854 0.764706 0.675019 0.764706 0.717067
7 0.501600 0.995020 0.800000 0.647619 0.800000 0.715789
8 0.454200 1.064920 0.752941 0.678140 0.752941 0.713314
9 0.463900 1.259467 0.800000 0.647619 0.800000 0.715789
10 0.299100 1.911444 0.564706 0.714751 0.564706 0.624193

[43/43 00:03]
Training ALBERT...
Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 00:41, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 1.150000 0.861627 0.800000 0.640000 0.800000 0.711111
2 0.744300 0.768885 0.800000 0.640000 0.800000 0.711111
3 0.873900 0.761136 0.800000 0.640000 0.800000 0.711111
4 0.821700 0.721361 0.800000 0.640000 0.800000 0.711111
5 0.592000 0.723359 0.788235 0.645783 0.788235 0.709934
6 0.723500 0.772278 0.752941 0.720157 0.752941 0.724295
7 0.480600 0.955723 0.800000 0.640000 0.800000 0.711111
8 0.599800 1.009586 0.658824 0.690596 0.658824 0.671569
9 0.860400 0.991177 0.623529 0.657703 0.623529 0.637018
10 0.392600 0.845117 0.717647 0.684370 0.717647 0.698176

[43/43 00:01]
Summary Table - Ordered by Validation Accuracy (Descending):
     Model  Train Accuracy  Validation Accuracy  Train Precision  Validation Precision  Train Recall  Validation Recall  Train F1  Validation F1
      BERT        0.938235             0.800000         0.920557              0.647619      0.938235           0.800000  0.928354       0.715789
    ALBERT        0.911765             0.717647         0.899979              0.684370      0.911765           0.717647  0.897419       0.698176
   RoBERTa        0.908824             0.658824         0.926414              0.695343      0.908824           0.658824  0.896636       0.676078
     XLNet        0.905882             0.564706         0.932250              0.714751      0.905882           0.564706  0.911145       0.624193
DistilBERT        0.941176             0.517647         0.950096              0.694342      0.941176           0.517647  0.938637       0.586964

Summary Table - Ordered by Validation F1 Score (Descending):
     Model  Train Accuracy  Validation Accuracy  Train Precision  Validation Precision  Train Recall  Validation Recall  Train F1  Validation F1
      BERT        0.938235             0.800000         0.920557              0.647619      0.938235           0.800000  0.928354       0.715789
    ALBERT        0.911765             0.717647         0.899979              0.684370      0.911765           0.717647  0.897419       0.698176
   RoBERTa        0.908824             0.658824         0.926414              0.695343      0.908824           0.658824  0.896636       0.676078
     XLNet        0.905882             0.564706         0.932250              0.714751      0.905882           0.564706  0.911145       0.624193
DistilBERT        0.941176             0.517647         0.950096              0.694342      0.941176           0.517647  0.938637       0.586964
Insights:¶
  • After increasing the epochs we see that there is a significant difference between the training and testing metrics.
  • When the epochs are increased then the model gets trained too well on the training data and might not generalize similarly well on the testing data.
  • The test metrics are similar to the previous run of the models when 3 epochs were chosen.

Code Explanation: This code is performing fine-tuning of multiple pre-trained transformer models (BERT, RoBERTa, DistilBERT, XLNet, ALBERT) for a classification task, using a custom evaluation function to calculate metrics (accuracy, precision, recall, F1-score) on training and validation sets. Here’s a detailed explanation of the code:

Steps in Detail¶
  1. Imports and Initial Setup:

    • The necessary libraries are imported, including TrainingArguments and Trainer from Hugging Face’s transformers library, and other supporting libraries like Dataset from datasets for handling data, and metrics functions from sklearn.
    • A compute_metrics function is defined to calculate classification metrics, which will be passed to the Trainer.
  2. Define compute_metrics Function:

    • This function takes in predictions (pred) and extracts the predicted and true labels.
    • It computes accuracy, precision, recall, and F1-score using accuracy_score and precision_recall_fscore_support from sklearn, returning a dictionary of these metrics.
  3. Initialize model_results List:

    • An empty list, model_results, is created to store evaluation metrics for each model.
  4. train_and_evaluate_model Function:

    • Load Pre-trained Tokenizer and Model:
      • The function initializes the tokenizer and model for the specified model_pretrained. The classification layer is set to match the number of unique labels (num_labels).
    • Ensure Contiguous Memory:
      • For each parameter, param.data.contiguous() is called to ensure parameters are stored in contiguous memory, helping avoid memory errors.
    • Tokenize Text Data:
      • Training and testing data are tokenized, with padding and truncation applied to maintain uniform input length (max length 512 tokens).
    • Convert Data to Hugging Face Dataset:
      • The tokenized data is converted into Dataset objects compatible with Trainer.
    • Define Training Arguments:
      • TrainingArguments are set up with options for batch sizes, warmup steps, weight decay, logging, evaluation strategy, etc.
      • evaluation_strategy="epoch" specifies that evaluation on the test dataset should occur after each epoch.
    • Initialize Trainer:
      • A Trainer instance is created with the model, training arguments, datasets, and custom compute_metrics function.
    • Train and Evaluate Model:
      • The model is fine-tuned with trainer.train().
      • Explicit evaluations on the training and validation datasets are performed using trainer.evaluate() to capture metrics at the last epoch, returned as train_metrics and eval_metrics.
  5. Define Transformer Models and Tokenizers:

    • A dictionary, models, maps model names to their respective tokenizer and model classes, along with model checkpoints (pre-trained configurations).
  6. Train and Evaluate Each Model in Loop:

    • A loop iterates over each model in models, calling train_and_evaluate_model to fine-tune and evaluate each one.
    • For each model, metrics for training and validation sets are appended to model_results, including accuracy, precision, recall, and F1-score.
  7. Create and Display Summary Tables:

    • The model_results list is converted to a DataFrame called results_df.
    • Two summary tables are printed:
      • Sorted by Validation Accuracy: Shows models ranked by descending validation accuracy.
      • Sorted by Validation F1 Score: Shows models ranked by descending validation F1-score.

Summary¶

  • Fine-tuning: Each pre-trained model is fine-tuned on the training data, optimizing both the pre-trained layers and the classification layer for the specific task.
  • Evaluation: Metrics for training and validation sets are computed and displayed for each model, enabling comparison based on accuracy and F1-score.
  • Custom Metric Calculation: The compute_metrics function ensures each model is evaluated in terms of weighted accuracy, precision, recall, and F1-score, providing a comprehensive performance overview across classes.

This code effectively evaluates different transformer models on a classification task and allows for easy comparison of their performance based on key metrics.

Modified the above code to pick best model based on F1 score in the below steps:¶

In [ ]:
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback
from datasets import Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import pandas as pd
import warnings
import torch

warnings.filterwarnings("ignore")

# Function to compute metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    accuracy = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

# Initialize a list to store results for each model
model_results = []

# Training function for each model
def train_and_evaluate_model(model_pretrained, tokenizer_class, model_class, train_texts, train_labels, test_texts, test_labels):
    tokenizer = tokenizer_class.from_pretrained(model_pretrained)
    model = model_class.from_pretrained(model_pretrained, num_labels=len(set(train_labels)))

    # Ensure all model parameters are in contiguous memory if necessary
    for param in model.parameters():
        if not param.is_contiguous():
            param.data = param.data.contiguous()

    # Tokenize datasets
    train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512)
    test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=512)

    # Convert tokenized data to Dataset format
    train_dataset = Dataset.from_dict({
        "input_ids": train_encodings["input_ids"],
        "attention_mask": train_encodings["attention_mask"],
        "labels": train_labels
    })
    test_dataset = Dataset.from_dict({
        "input_ids": test_encodings["input_ids"],
        "attention_mask": test_encodings["attention_mask"],
        "labels": test_labels
    })

    # Define training arguments to optimize for F1
    training_args = TrainingArguments(
        output_dir='./results',
        num_train_epochs=5,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="f1", # optimizing and picking the models for best f-1 score
        greater_is_better=True,
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir='./logs',
        logging_steps=10,
        report_to="none"
    )

    # Initialize the Trainer with EarlyStoppingCallback
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
    )

    # Train and evaluate
    trainer.train()
    # Evaluate for training and validation metrics
    train_metrics = trainer.evaluate(train_dataset)
    eval_metrics = trainer.evaluate(test_dataset)

    return train_metrics, eval_metrics

# Define models and tokenizers
models = {
    "BERT": (BertTokenizer, BertForSequenceClassification, 'bert-base-uncased'),
    "RoBERTa": (RobertaTokenizer, RobertaForSequenceClassification, 'roberta-base'),
    "DistilBERT": (DistilBertTokenizer, DistilBertForSequenceClassification, 'distilbert-base-uncased'),
    "XLNet": (XLNetTokenizer, XLNetForSequenceClassification, 'xlnet-base-cased'),
    "ALBERT": (AlbertTokenizer, AlbertForSequenceClassification, 'albert-base-v2')
}

# Loop through models and collect results
for model_name, (tokenizer_class, model_class, model_pretrained) in models.items():
    print(f"Training {model_name}...")
    train_metrics, eval_metrics = train_and_evaluate_model(model_pretrained, tokenizer_class, model_class, train_texts, train_labels, test_texts, test_labels)

    # Append the last epoch's metrics for each model to model_results list
    model_results.append({
        "Model": model_name,
        "Train Accuracy": train_metrics.get("eval_accuracy", 0),
        "Validation Accuracy": eval_metrics.get("eval_accuracy", 0),
        "Train Precision": train_metrics.get("eval_precision", 0),
        "Validation Precision": eval_metrics.get("eval_precision", 0),
        "Train Recall": train_metrics.get("eval_recall", 0),
        "Validation Recall": eval_metrics.get("eval_recall", 0),
        "Train F1": train_metrics.get("eval_f1", 0),
        "Validation F1": eval_metrics.get("eval_f1", 0),
    })

# Convert results to DataFrame and display sorted tables
results_df = pd.DataFrame(model_results)

# Display summary tables
print("Summary Table - Ordered by Validation Accuracy (Descending):")
print(results_df.sort_values(by="Validation Accuracy", ascending=False).to_string(index=False))

print("\nSummary Table - Ordered by Validation F1 Score (Descending):")
print(results_df.sort_values(by="Validation F1", ascending=False).to_string(index=False))
Training BERT...
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[215/215 00:37, Epoch 5/5]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 1.444200 1.283723 0.729412 0.642590 0.729412 0.683187
2 0.861600 0.816862 0.800000 0.640000 0.800000 0.711111
3 0.897100 0.773485 0.800000 0.640000 0.800000 0.711111
4 0.800800 0.822465 0.800000 0.640000 0.800000 0.711111
5 0.685200 0.796234 0.764706 0.634146 0.764706 0.693333

[43/43 00:01]
Training RoBERTa...
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[172/215 00:30 < 00:07, 5.63 it/s, Epoch 4/5]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 1.457800 1.340572 0.800000 0.640000 0.800000 0.711111
2 0.714800 0.806190 0.800000 0.640000 0.800000 0.711111
3 0.881700 0.751741 0.800000 0.640000 0.800000 0.711111
4 0.822300 0.727247 0.800000 0.640000 0.800000 0.711111

[43/43 00:01]
Training DistilBERT...
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[172/215 00:20 < 00:05, 8.37 it/s, Epoch 4/5]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 1.420000 1.302629 0.800000 0.640000 0.800000 0.711111
2 0.815800 0.781738 0.800000 0.640000 0.800000 0.711111
3 0.890600 0.766387 0.800000 0.640000 0.800000 0.711111
4 0.828800 0.769921 0.800000 0.640000 0.800000 0.711111

[43/43 00:00]
Training XLNet...
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['logits_proj.bias', 'logits_proj.weight', 'sequence_summary.summary.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[172/215 00:32 < 00:08, 5.17 it/s, Epoch 4/5]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 1.002400 0.766483 0.800000 0.640000 0.800000 0.711111
2 0.701100 0.763081 0.800000 0.640000 0.800000 0.711111
3 0.862500 0.757904 0.800000 0.640000 0.800000 0.711111
4 0.676300 0.714607 0.788235 0.638095 0.788235 0.705263

[43/43 00:02]
Training ALBERT...
Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[172/215 00:18 < 00:04, 9.13 it/s, Epoch 4/5]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 1.185000 0.974409 0.800000 0.640000 0.800000 0.711111
2 0.737400 0.760119 0.800000 0.640000 0.800000 0.711111
3 0.818800 0.777122 0.800000 0.640000 0.800000 0.711111
4 0.761500 0.840861 0.800000 0.640000 0.800000 0.711111

[43/43 00:01]
Summary Table - Ordered by Validation Accuracy (Descending):
     Model  Train Accuracy  Validation Accuracy  Train Precision  Validation Precision  Train Recall  Validation Recall  Train F1  Validation F1
      BERT        0.729412                  0.8         0.532042                  0.64      0.729412                0.8  0.615286       0.711111
   RoBERTa        0.729412                  0.8         0.532042                  0.64      0.729412                0.8  0.615286       0.711111
DistilBERT        0.729412                  0.8         0.532042                  0.64      0.729412                0.8  0.615286       0.711111
     XLNet        0.729412                  0.8         0.532042                  0.64      0.729412                0.8  0.615286       0.711111
    ALBERT        0.729412                  0.8         0.532042                  0.64      0.729412                0.8  0.615286       0.711111

Summary Table - Ordered by Validation F1 Score (Descending):
     Model  Train Accuracy  Validation Accuracy  Train Precision  Validation Precision  Train Recall  Validation Recall  Train F1  Validation F1
      BERT        0.729412                  0.8         0.532042                  0.64      0.729412                0.8  0.615286       0.711111
   RoBERTa        0.729412                  0.8         0.532042                  0.64      0.729412                0.8  0.615286       0.711111
DistilBERT        0.729412                  0.8         0.532042                  0.64      0.729412                0.8  0.615286       0.711111
     XLNet        0.729412                  0.8         0.532042                  0.64      0.729412                0.8  0.615286       0.711111
    ALBERT        0.729412                  0.8         0.532042                  0.64      0.729412                0.8  0.615286       0.711111
Insights:¶

The test metrics are similar to the previous outputs. The validation/Test metrics are better compared to the train metrics because of the following reasons:

  • Data Imbalance: If the training data has a higher proportion of a certain class or is less diverse, the model might underperform during training but generalize better on a more balanced or diverse validation set.
  • Regularization Effects: Regularization techniques like dropout or weight decay may help the model generalize better on the validation set, even if it struggles to achieve higher accuracy during training.
  • Overfitting Prevention: If the model is not fully fitting to the training data (e.g., due to early stopping or fewer epochs), it might actually perform better on the validation set due to an implicit regularization effect.

This above code fine-tunes several pre-trained transformer models (e.g., BERT, RoBERTa, DistilBERT, XLNet, and ALBERT) on a classification task. The training is structured to prioritize optimizing the F1 score by setting it as the metric for selecting the best model and using early stopping to prevent overfitting. Here’s a breakdown of the steps in detail:

Steps in Detail¶
  1. Imports and Initial Setup:

    • Importing necessary libraries from transformers (for setting up training arguments and trainers) and datasets (for handling dataset creation). sklearn is used for computing performance metrics, while torch enables working with tensors if needed. Warnings are suppressed to reduce unnecessary output.
  2. Define compute_metrics Function:

    • This function is used to compute metrics during evaluation. It extracts labels and preds (predicted classes) from the input pred.
    • The accuracy, precision, recall, and f1 scores are calculated using sklearn's functions, with a weighted average applied to handle imbalanced data.
    • The function returns a dictionary containing these metrics, which will be used to assess model performance after each epoch.
  3. Initialize model_results List:

    • This empty list will store results for each model trained, helping to track metrics like accuracy and F1-score across models.
  4. train_and_evaluate_model Function:

    • Load Pre-trained Tokenizer and Model:
      • A tokenizer and model for the given model_pretrained are loaded, configured with the correct number of output labels based on the training data’s label count.
    • Ensure Contiguous Memory:
      • This step verifies that model parameters are stored in contiguous memory locations to avoid memory errors.
    • Tokenize Datasets:
      • The training and testing text data are tokenized, with truncation and padding to standardize inputs to a maximum length of 512 tokens.
    • Convert Tokenized Data to Dataset Format:
      • The tokenized data is converted into a Dataset format that’s compatible with the Trainer.
  5. Define TrainingArguments:

    • This is a crucial part of the code where training behavior and configuration are set:
      • output_dir='./results': Specifies the directory for saving model checkpoints.
      • num_train_epochs=5: Sets the number of epochs for training.
      • per_device_train_batch_size=8 and per_device_eval_batch_size=8: Defines batch sizes for training and evaluation.
      • evaluation_strategy="epoch": Instructs the Trainer to evaluate the model on the validation set at the end of each epoch.
      • save_strategy="epoch": Saves the model’s checkpoints after each epoch, making it possible to restore the model at any checkpoint.
      • load_best_model_at_end=True: Ensures that, after training, the best model according to the specified metric (in this case, F1 score) is loaded.
      • metric_for_best_model="f1": Sets F1 score as the target metric for model selection, meaning that the model with the highest F1 score will be chosen.
      • greater_is_better=True: Indicates that a higher F1 score is preferred for selecting the best model.
      • warmup_steps=500: Sets the number of warmup steps to gradually increase the learning rate at the start of training.
      • weight_decay=0.01: Applies a weight decay (L2 regularization) to help prevent overfitting.
      • logging_dir='./logs': Specifies the directory for logging training information.
      • logging_steps=10: Logs the training progress every 10 steps.
      • report_to="none": Prevents logging to external tools like TensorBoard; logs are instead kept within the script or Jupyter environment.
  6. Initialize the Trainer with Early Stopping:

    • A Trainer instance is created, responsible for managing training and evaluation.
    • The EarlyStoppingCallback is set with early_stopping_patience=3, which stops training if the F1 score does not improve for three consecutive evaluations (i.e., epochs), helping to avoid overfitting.
  7. Training and Evaluation:

    • Training: The trainer.train() call fine-tunes the model on the training data.
    • Evaluation: The model is evaluated separately on both the training and validation datasets after training completes. Metrics from these evaluations are stored in train_metrics and eval_metrics.
  8. Model Selection and Result Storage:

    • For each model, train_and_evaluate_model returns training and validation metrics, which are stored in model_results.
    • The last epoch’s metrics for each model are stored in model_results, specifically tracking accuracy, precision, recall, and F1-score on both training and validation sets.
  9. Results Conversion and Display:

    • model_results is converted to a DataFrame (results_df), allowing easy sorting and comparison of results.
    • The code then prints two summary tables:
      • Sorted by Validation Accuracy: Ranks models by descending accuracy on the validation set.
      • Sorted by Validation F1 Score: Ranks models by descending F1-score on the validation set, with F1 being the primary optimization metric.

F1 Score Improvement with Each Epoch:¶

The training process is structured to prioritize improving the F1 score because:

  • metric_for_best_model is set to "f1", so the best model is determined based on F1 score.
  • EarlyStoppingCallback monitors F1 score on the validation set, stopping training if there’s no improvement for 3 epochs.

The setup aims to improve F1 score specifically, though other metrics are tracked. In each epoch, the validation F1 score is recalculated, and if there is an improvement, the best-performing model is saved.

Balancing by Random Over Sampler and running the models¶

In [ ]:
# Install required libraries
!pip install transformers datasets torch scikit-learn imbalanced-learn

from transformers import (
    BertTokenizer, BertForSequenceClassification,
    RobertaTokenizer, RobertaForSequenceClassification,
    DistilBertTokenizer, DistilBertForSequenceClassification,
    XLNetTokenizer, XLNetForSequenceClassification,
    AlbertTokenizer, AlbertForSequenceClassification,
    Trainer, TrainingArguments, EarlyStoppingCallback
)
from datasets import Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
import pandas as pd
import warnings
import torch

# Suppress all warnings
warnings.filterwarnings("ignore")

# Load data
df = data_original.copy()
df = df[['Description', 'Accident Level']]  # Columns with text and labels

# Encode labels to numeric format
label_encoder = LabelEncoder()
df['labels'] = label_encoder.fit_transform(df['Accident Level'])

# Split data into training and testing sets
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df['Description'].tolist(),
    df['labels'].tolist(),
    test_size=0.2,
    random_state=42,
    stratify=df['labels']  # Ensures proportional representation in splits
)

# Initialize RandomOverSampler
ros = RandomOverSampler(random_state=42)

# Reshape train_texts for oversampling (required by RandomOverSampler)
train_texts_df = pd.DataFrame(train_texts, columns=['Description'])

# Apply RandomOverSampler to training data
train_texts_resampled, train_labels_resampled = ros.fit_resample(train_texts_df, train_labels)

# Convert back to lists
train_texts_resampled = train_texts_resampled['Description'].tolist()

# Optional: Verify the class distribution after oversampling
from collections import Counter
print("Class distribution after oversampling:", Counter(train_labels_resampled))

# Function to compute metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    accuracy = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average='weighted'
    )
    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

# Training and evaluation function
def train_and_evaluate_model(model_pretrained, tokenizer_class, model_class, train_texts, train_labels, test_texts, test_labels):
    # Initialize tokenizer and model
    tokenizer = tokenizer_class.from_pretrained(model_pretrained)
    model = model_class.from_pretrained(model_pretrained, num_labels=len(set(train_labels)))

    # Ensure all model parameters are contiguous in memory
    for param in model.parameters():
        if not param.is_contiguous():
            param.data = param.data.contiguous()

    # Tokenize the training and testing data
    train_encodings = tokenizer(
        train_texts, truncation=True, padding=True, max_length=512
    )
    test_encodings = tokenizer(
        test_texts, truncation=True, padding=True, max_length=512
    )

    # Convert tokenized data to Hugging Face Dataset format
    train_dataset = Dataset.from_dict({
        "input_ids": train_encodings["input_ids"],
        "attention_mask": train_encodings["attention_mask"],
        "labels": train_labels
    })
    test_dataset = Dataset.from_dict({
        "input_ids": test_encodings["input_ids"],
        "attention_mask": test_encodings["attention_mask"],
        "labels": test_labels
    })

    # Define training arguments to optimize for F1 Score
    training_args = TrainingArguments(
        output_dir='./results',
        num_train_epochs=10,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="f1",  # Optimize based on F1 score
        greater_is_better=True,
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir='./logs',
        logging_steps=10,
        report_to="none"  # Disable reporting to external services
    )

    # Initialize Trainer with EarlyStoppingCallback
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
    )

    # Train the model
    trainer.train()

    # Evaluate on training data
    train_metrics = trainer.evaluate(train_dataset)

    # Evaluate on validation/test data
    eval_metrics = trainer.evaluate(test_dataset)

    # Compute predictions on the train set for classification report
    train_predictions = trainer.predict(train_dataset)
    train_preds = train_predictions.predictions.argmax(-1)
    train_report = classification_report(train_labels, train_preds, target_names=label_encoder.classes_)

    # Compute predictions on the test set for classification report
    test_predictions = trainer.predict(test_dataset)
    test_preds = test_predictions.predictions.argmax(-1)
    test_report = classification_report(test_labels, test_preds, target_names=label_encoder.classes_)

    return train_metrics, eval_metrics, train_report, test_report

# Define models and tokenizers
models = {
    "BERT": (BertTokenizer, BertForSequenceClassification, 'bert-base-uncased'),
    "RoBERTa": (RobertaTokenizer, RobertaForSequenceClassification, 'roberta-base'),
    "DistilBERT": (DistilBertTokenizer, DistilBertForSequenceClassification, 'distilbert-base-uncased'),
    "XLNet": (XLNetTokenizer, XLNetForSequenceClassification, 'xlnet-base-cased'),
    "ALBERT": (AlbertTokenizer, AlbertForSequenceClassification, 'albert-base-v2')
}

# Initialize a list to store results for each model
model_results = []

# Loop through each model, train, evaluate, and collect metrics
for model_name, (tokenizer_class, model_class, model_pretrained) in models.items():
    print(f"Training {model_name}...")
    train_metrics, eval_metrics, train_report, test_report = train_and_evaluate_model(
        model_pretrained,
        tokenizer_class,
        model_class,
        train_texts_resampled,
        train_labels_resampled,
        test_texts,
        test_labels
    )

    # Append metrics to the results list
    model_results.append({
        "Model": model_name,
        "Train Accuracy": train_metrics.get("eval_accuracy", 0),
        "Test Accuracy": eval_metrics.get("eval_accuracy", 0),
        "Train Precision": train_metrics.get("eval_precision", 0),
        "Test Precision": eval_metrics.get("eval_precision", 0),
        "Train Recall": train_metrics.get("eval_recall", 0),
        "Test Recall": eval_metrics.get("eval_recall", 0),
        "Train F1": train_metrics.get("eval_f1", 0),
        "Test F1": eval_metrics.get("eval_f1", 0),
    })

    print(f"Finished training {model_name}.")
    print(f"Train Metrics for {model_name}: {train_metrics}")
    print(f"Validation Metrics for {model_name}: {eval_metrics}")
    print(f"\nTrain Classification Report for {model_name}:\n{train_report}")
    print(f"\nTest Classification Report for {model_name}:\n{test_report}\n")

# Convert results to a DataFrame
results_df = pd.DataFrame(model_results)

# Display summary tables sorted by Validation Accuracy and Validation F1 Score
print("Summary Table - Ordered by Validation Accuracy (Descending):")
print(results_df.sort_values(by="Test Accuracy", ascending=False).to_string(index=False))

print("\nSummary Table - Ordered by Validation F1 Score (Descending):")
print(results_df.sort_values(by="Test F1", ascending=False).to_string(index=False))
Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.44.2)
Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (3.1.0)
Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (2.5.0+cu121)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.5.2)
Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.10/dist-packages (0.12.4)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.16.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.23.2 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.24.7)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (24.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.2)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2024.9.11)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.32.3)
Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.4.5)
Requirement already satisfied: tokenizers<0.20,>=0.19 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.19.1)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.6)
Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (17.0.0)
Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.3.8)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (2.2.2)
Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.5.0)
Requirement already satisfied: multiprocess<0.70.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.70.16)
Requirement already satisfied: fsspec<=2024.9.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets) (2024.9.0)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.10.10)
Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch) (4.12.2)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch) (3.1.4)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch) (1.3.0)
Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.13.1)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.5.0)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.4.3)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (24.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.5.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.1.0)
Requirement already satisfied: yarl<2.0,>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.17.1)
Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2024.8.30)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch) (3.0.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)
Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from yarl<2.0,>=1.12.0->aiohttp->datasets) (0.2.0)
Class distribution after oversampling: Counter({0: 253, 3: 253, 1: 253, 4: 253, 2: 253})
Training BERT...
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[1431/1590 02:22 < 00:15, 10.04 it/s, Epoch 9/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 1.049600 1.351589 0.411765 0.532632 0.411765 0.448841
2 0.265800 1.235443 0.588235 0.605933 0.588235 0.590708
3 0.042300 1.663724 0.670588 0.541629 0.670588 0.599249
4 0.001900 2.107168 0.694118 0.546618 0.694118 0.611600
5 0.211200 2.259966 0.729412 0.547059 0.729412 0.625210
6 0.012700 2.161124 0.729412 0.553650 0.729412 0.629492
7 0.033800 2.203558 0.729412 0.553650 0.729412 0.629492
8 0.005400 2.285457 0.729412 0.553650 0.729412 0.629492
9 0.057300 2.331009 0.729412 0.553650 0.729412 0.629492

Finished training BERT.
Train Metrics for BERT: {'eval_loss': 0.015103112906217575, 'eval_accuracy': 0.9944664031620554, 'eval_precision': 0.9946153846153846, 'eval_recall': 0.9944664031620554, 'eval_f1': 0.994465343943247, 'eval_runtime': 4.4745, 'eval_samples_per_second': 282.715, 'eval_steps_per_second': 35.535, 'epoch': 9.0}
Validation Metrics for BERT: {'eval_loss': 2.1611242294311523, 'eval_accuracy': 0.7294117647058823, 'eval_precision': 0.5536498936924167, 'eval_recall': 0.7294117647058823, 'eval_f1': 0.6294923448831587, 'eval_runtime': 0.3177, 'eval_samples_per_second': 267.518, 'eval_steps_per_second': 34.62, 'epoch': 9.0}

Train Classification Report for BERT:
              precision    recall  f1-score   support

           I       1.00      1.00      1.00       253
          II       1.00      0.97      0.99       253
         III       1.00      1.00      1.00       253
          IV       0.97      1.00      0.99       253
           V       1.00      1.00      1.00       253

    accuracy                           0.99      1265
   macro avg       0.99      0.99      0.99      1265
weighted avg       0.99      0.99      0.99      1265


Test Classification Report for BERT:
              precision    recall  f1-score   support

           I       0.75      0.98      0.85        63
          II       0.00      0.00      0.00         8
         III       0.00      0.00      0.00         6
          IV       0.00      0.00      0.00         6
           V       0.00      0.00      0.00         2

    accuracy                           0.73        85
   macro avg       0.15      0.20      0.17        85
weighted avg       0.55      0.73      0.63        85


Training RoBERTa...
tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]
vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]
merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]
config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[1431/1590 02:37 < 00:17, 9.10 it/s, Epoch 9/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 0.994800 1.322501 0.341176 0.632277 0.341176 0.409500
2 0.152200 2.263594 0.364706 0.563000 0.364706 0.422125
3 0.114200 2.062973 0.705882 0.577066 0.705882 0.631593
4 0.001900 2.463611 0.600000 0.584977 0.600000 0.592008
5 0.152900 2.354644 0.729412 0.553650 0.729412 0.629492
6 0.004900 2.478804 0.729412 0.586657 0.729412 0.641258
7 0.135400 2.564391 0.729412 0.586657 0.729412 0.641258
8 0.009500 2.641434 0.705882 0.570147 0.705882 0.627286
9 0.049600 2.637519 0.705882 0.570147 0.705882 0.627286

Finished training RoBERTa.
Train Metrics for RoBERTa: {'eval_loss': 0.011559602804481983, 'eval_accuracy': 0.9944664031620554, 'eval_precision': 0.9946153846153846, 'eval_recall': 0.9944664031620554, 'eval_f1': 0.994465343943247, 'eval_runtime': 4.4827, 'eval_samples_per_second': 282.194, 'eval_steps_per_second': 35.469, 'epoch': 9.0}
Validation Metrics for RoBERTa: {'eval_loss': 2.478804111480713, 'eval_accuracy': 0.7294117647058823, 'eval_precision': 0.5866571018651363, 'eval_recall': 0.7294117647058823, 'eval_f1': 0.6412576064908723, 'eval_runtime': 0.3411, 'eval_samples_per_second': 249.185, 'eval_steps_per_second': 32.248, 'epoch': 9.0}

Train Classification Report for RoBERTa:
              precision    recall  f1-score   support

           I       1.00      1.00      1.00       253
          II       1.00      0.97      0.99       253
         III       1.00      1.00      1.00       253
          IV       0.97      1.00      0.99       253
           V       1.00      1.00      1.00       253

    accuracy                           0.99      1265
   macro avg       0.99      0.99      0.99      1265
weighted avg       0.99      0.99      0.99      1265


Test Classification Report for RoBERTa:
              precision    recall  f1-score   support

           I       0.74      0.97      0.84        63
          II       0.00      0.00      0.00         8
         III       0.50      0.17      0.25         6
          IV       0.00      0.00      0.00         6
           V       0.00      0.00      0.00         2

    accuracy                           0.73        85
   macro avg       0.25      0.23      0.22        85
weighted avg       0.59      0.73      0.64        85


Training DistilBERT...
tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]
vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]
config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[1272/1590 01:20 < 00:20, 15.87 it/s, Epoch 8/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 1.168700 1.407919 0.305882 0.620078 0.305882 0.329780
2 0.136600 1.395260 0.576471 0.578972 0.576471 0.577179
3 0.005100 2.059941 0.694118 0.546618 0.694118 0.611600
4 0.001900 2.491858 0.670588 0.555882 0.670588 0.607871
5 0.037000 2.584239 0.694118 0.553537 0.694118 0.615907
6 0.019300 2.640933 0.694118 0.553537 0.694118 0.615907
7 0.024000 2.694793 0.694118 0.553537 0.694118 0.615907
8 0.004800 2.730828 0.694118 0.553537 0.694118 0.615907

Finished training DistilBERT.
Train Metrics for DistilBERT: {'eval_loss': 0.00810485240072012, 'eval_accuracy': 0.9944664031620554, 'eval_precision': 0.9946153846153846, 'eval_recall': 0.9944664031620554, 'eval_f1': 0.994465343943247, 'eval_runtime': 2.6244, 'eval_samples_per_second': 482.01, 'eval_steps_per_second': 60.585, 'epoch': 8.0}
Validation Metrics for DistilBERT: {'eval_loss': 2.5842387676239014, 'eval_accuracy': 0.6941176470588235, 'eval_precision': 0.5535368577810871, 'eval_recall': 0.6941176470588235, 'eval_f1': 0.615907207953604, 'eval_runtime': 0.1921, 'eval_samples_per_second': 442.489, 'eval_steps_per_second': 57.263, 'epoch': 8.0}

Train Classification Report for DistilBERT:
              precision    recall  f1-score   support

           I       1.00      1.00      1.00       253
          II       1.00      0.97      0.99       253
         III       1.00      1.00      1.00       253
          IV       0.97      1.00      0.99       253
           V       1.00      1.00      1.00       253

    accuracy                           0.99      1265
   macro avg       0.99      0.99      0.99      1265
weighted avg       0.99      0.99      0.99      1265


Test Classification Report for DistilBERT:
              precision    recall  f1-score   support

           I       0.75      0.94      0.83        63
          II       0.00      0.00      0.00         8
         III       0.00      0.00      0.00         6
          IV       0.00      0.00      0.00         6
           V       0.00      0.00      0.00         2

    accuracy                           0.69        85
   macro avg       0.15      0.19      0.17        85
weighted avg       0.55      0.69      0.62        85


Training XLNet...
spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]
config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]
pytorch_model.bin:   0%|          | 0.00/467M [00:00<?, ?B/s]
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['logits_proj.bias', 'logits_proj.weight', 'sequence_summary.summary.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[1431/1590 03:16 < 00:21, 7.27 it/s, Epoch 9/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 0.836000 1.111737 0.647059 0.614902 0.647059 0.629265
2 0.221800 1.628147 0.505882 0.605473 0.505882 0.546473
3 0.046000 2.495870 0.729412 0.553650 0.729412 0.629492
4 0.002900 4.445165 0.435294 0.640326 0.435294 0.499970
5 0.070200 2.492741 0.717647 0.565147 0.717647 0.632332
6 0.020000 2.966770 0.705882 0.590118 0.705882 0.642583
7 0.029800 2.890443 0.694118 0.575387 0.694118 0.629200
8 0.001900 3.000550 0.705882 0.542324 0.705882 0.613387
9 0.062600 2.985967 0.705882 0.549020 0.705882 0.617647

Finished training XLNet.
Train Metrics for XLNet: {'eval_loss': 0.007476487662643194, 'eval_accuracy': 0.9952569169960475, 'eval_precision': 0.9953667953667954, 'eval_recall': 0.9952569169960475, 'eval_f1': 0.99525625, 'eval_runtime': 6.6545, 'eval_samples_per_second': 190.096, 'eval_steps_per_second': 23.894, 'epoch': 9.0}
Validation Metrics for XLNet: {'eval_loss': 2.9667699337005615, 'eval_accuracy': 0.7058823529411765, 'eval_precision': 0.5901176470588235, 'eval_recall': 0.7058823529411765, 'eval_f1': 0.6425831202046036, 'eval_runtime': 0.4559, 'eval_samples_per_second': 186.429, 'eval_steps_per_second': 24.126, 'epoch': 9.0}

Train Classification Report for XLNet:
              precision    recall  f1-score   support

           I       1.00      1.00      1.00       253
          II       0.98      1.00      0.99       253
         III       1.00      1.00      1.00       253
          IV       1.00      0.98      0.99       253
           V       1.00      1.00      1.00       253

    accuracy                           1.00      1265
   macro avg       1.00      1.00      1.00      1265
weighted avg       1.00      1.00      1.00      1265


Test Classification Report for XLNet:
              precision    recall  f1-score   support

           I       0.79      0.94      0.86        63
          II       0.00      0.00      0.00         8
         III       0.10      0.17      0.12         6
          IV       0.00      0.00      0.00         6
           V       0.00      0.00      0.00         2

    accuracy                           0.71        85
   macro avg       0.18      0.22      0.20        85
weighted avg       0.59      0.71      0.64        85


Training ALBERT...
tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]
spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]
config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]
Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[1590/1590 02:30, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 0.927300 1.255324 0.376471 0.551288 0.376471 0.440584
2 0.229400 1.119273 0.588235 0.561497 0.588235 0.574555
3 0.035100 1.588666 0.576471 0.631529 0.576471 0.591244
4 0.932200 1.172791 0.576471 0.564706 0.576471 0.568627
5 0.461700 1.790805 0.623529 0.549556 0.623529 0.583798
6 0.032300 2.113769 0.717647 0.551363 0.717647 0.623611
7 0.081500 2.533544 0.729412 0.547059 0.729412 0.625210
8 0.005200 2.596861 0.729412 0.547059 0.729412 0.625210
9 0.052000 2.654782 0.729412 0.547059 0.729412 0.625210
10 0.000300 2.662885 0.729412 0.547059 0.729412 0.625210

Finished training ALBERT.
Train Metrics for ALBERT: {'eval_loss': 0.013694602064788342, 'eval_accuracy': 0.9936758893280633, 'eval_precision': 0.9938056680161942, 'eval_recall': 0.9936758893280633, 'eval_f1': 0.9936749155617316, 'eval_runtime': 5.2624, 'eval_samples_per_second': 240.383, 'eval_steps_per_second': 30.214, 'epoch': 10.0}
Validation Metrics for ALBERT: {'eval_loss': 2.533543825149536, 'eval_accuracy': 0.7294117647058823, 'eval_precision': 0.5470588235294118, 'eval_recall': 0.7294117647058823, 'eval_f1': 0.6252100840336134, 'eval_runtime': 0.3688, 'eval_samples_per_second': 230.485, 'eval_steps_per_second': 29.827, 'epoch': 10.0}

Train Classification Report for ALBERT:
              precision    recall  f1-score   support

           I       1.00      1.00      1.00       253
          II       1.00      0.97      0.98       253
         III       1.00      1.00      1.00       253
          IV       0.97      1.00      0.99       253
           V       1.00      1.00      1.00       253

    accuracy                           0.99      1265
   macro avg       0.99      0.99      0.99      1265
weighted avg       0.99      0.99      0.99      1265


Test Classification Report for ALBERT:
              precision    recall  f1-score   support

           I       0.74      0.98      0.84        63
          II       0.00      0.00      0.00         8
         III       0.00      0.00      0.00         6
          IV       0.00      0.00      0.00         6
           V       0.00      0.00      0.00         2

    accuracy                           0.73        85
   macro avg       0.15      0.20      0.17        85
weighted avg       0.55      0.73      0.63        85


Summary Table - Ordered by Validation Accuracy (Descending):
     Model  Train Accuracy  Test Accuracy  Train Precision  Test Precision  Train Recall  Test Recall  Train F1  Test F1
      BERT        0.994466       0.729412         0.994615        0.553650      0.994466     0.729412  0.994465 0.629492
   RoBERTa        0.994466       0.729412         0.994615        0.586657      0.994466     0.729412  0.994465 0.641258
    ALBERT        0.993676       0.729412         0.993806        0.547059      0.993676     0.729412  0.993675 0.625210
     XLNet        0.995257       0.705882         0.995367        0.590118      0.995257     0.705882  0.995256 0.642583
DistilBERT        0.994466       0.694118         0.994615        0.553537      0.994466     0.694118  0.994465 0.615907

Summary Table - Ordered by Validation F1 Score (Descending):
     Model  Train Accuracy  Test Accuracy  Train Precision  Test Precision  Train Recall  Test Recall  Train F1  Test F1
     XLNet        0.995257       0.705882         0.995367        0.590118      0.995257     0.705882  0.995256 0.642583
   RoBERTa        0.994466       0.729412         0.994615        0.586657      0.994466     0.729412  0.994465 0.641258
      BERT        0.994466       0.729412         0.994615        0.553650      0.994466     0.729412  0.994465 0.629492
    ALBERT        0.993676       0.729412         0.993806        0.547059      0.993676     0.729412  0.993675 0.625210
DistilBERT        0.994466       0.694118         0.994615        0.553537      0.994466     0.694118  0.994465 0.615907
Insights:¶
  • BERT is performing best in terms of Test accuracy and XLNET is performing best in terms of test F1 score.
  • None of the models show improvement over the previous run of models.
  • The performance of the model degraded after over sampling the data. The Best test accuracy decreased from 80% to 72% and best F1 score decreased from 71% to 64%.

Here's a detailed breakdown of each step in the code, especially focusing on why and how RandomOverSampler is used to handle data imbalance and how it aims to improve the model's performance.

Step-by-Step Explanation¶
  1. Installing Required Libraries:

    !pip install transformers datasets torch scikit-learn imbalanced-learn
    

    Required libraries are installed for deep learning (using transformers for model handling), data manipulation (datasets, pandas, and torch), evaluation (scikit-learn), and handling imbalanced datasets (imbalanced-learn).

  2. Imports: Libraries and classes for model tokenizers, encoders, sampling, and evaluation metrics are imported, alongside warnings to suppress non-essential warnings.

  3. Data Loading and Preprocessing:

    df = data_original.copy()
    df = df[['Description', 'Accident Level']]
    

    The data is loaded, selecting only the columns relevant to text descriptions and accident severity levels.

  4. Label Encoding:

    label_encoder = LabelEncoder()
    df['labels'] = label_encoder.fit_transform(df['Accident Level'])
    

    LabelEncoder converts the accident severity levels from categorical values into numeric labels, which are required for model training.

  5. Train-Test Split with Stratification:

    train_texts, test_texts, train_labels, test_labels = train_test_split(
        df['Description'].tolist(),
        df['labels'].tolist(),
        test_size=0.2,
        random_state=42,
        stratify=df['labels']
    )
    

    Here, the data is split into training and testing sets while maintaining the original label distribution proportionally using stratify. This ensures that each class is represented similarly in both training and test sets.

  6. Using RandomOverSampler for Balancing:

    ros = RandomOverSampler(random_state=42)
    train_texts_df = pd.DataFrame(train_texts, columns=['Description'])
    train_texts_resampled, train_labels_resampled = ros.fit_resample(train_texts_df, train_labels)
    train_texts_resampled = train_texts_resampled['Description'].tolist()
    
    • Purpose of RandomOverSampler: RandomOverSampler is used to balance the dataset by randomly replicating samples from minority classes until each class has the same number of samples as the majority class. This approach aims to improve the model's ability to learn equally well for all classes by providing enough examples from each class.
    • Why RandomOverSampler? Compared to techniques like SMOTE (Synthetic Minority Over-sampling Technique), which creates synthetic samples, random oversampling avoids introducing potentially noisy or unrealistic synthetic samples by simply duplicating real instances from the minority classes. This simplicity often works well for text-based data, where synthetically generating new sentences can be challenging without altering semantics.
    • Expected Benefit: Without balancing, a model trained on imbalanced data tends to be biased toward predicting the majority class. By balancing with oversampling, each class has an equal presence, encouraging the model to focus on learning features that discriminate all classes effectively. This can lead to a more balanced performance, particularly in metrics like F1 score, which considers both precision and recall.
  7. Metrics Calculation Function:

    def compute_metrics(pred):
        labels = pred.label_ids
        preds = pred.predictions.argmax(-1)
        accuracy = accuracy_score(labels, preds)
        precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
        return {
            "accuracy": accuracy,
            "precision": precision,
            "recall": recall,
            "f1": f1
        }
    

    This function computes the performance metrics: accuracy, precision, recall, and F1 score (weighted for imbalanced classes). F1 score is particularly relevant here as it balances precision and recall, making it a good metric for imbalanced data.

  8. Model Training and Evaluation: The function train_and_evaluate_model handles model training and evaluation:

    • Tokenizer and Model Initialization: The model and tokenizer are loaded from pre-trained configurations.

    • Tokenization: The text is tokenized, with sequences truncated or padded to a max length of 512 tokens.

    • Dataset Preparation: Tokenized data is converted into a Hugging Face Dataset for both train and test sets.

    • Training Arguments:

      training_args = TrainingArguments(
          output_dir='./results',
          num_train_epochs=10,
          per_device_train_batch_size=8,
          per_device_eval_batch_size=8,
          evaluation_strategy="epoch",
          save_strategy="epoch",
          load_best_model_at_end=True,
          metric_for_best_model="f1",
          greater_is_better=True,
          warmup_steps=500,
          weight_decay=0.01,
          logging_dir='./logs',
          logging_steps=10,
          report_to="none"
      )
      
      • evaluation_strategy="epoch": Evaluation is done at the end of each epoch.
      • metric_for_best_model="f1": The model selects the best checkpoint based on the highest F1 score, making it suitable for imbalanced data.
      • early_stopping_patience=3 ensures training stops early if the model stops improving after 3 epochs, preventing overfitting on the resampled data.
    • Training and Evaluation with Trainer: The model is trained using Trainer, with metrics calculated and stored after training and evaluation on the train and test datasets.

  9. Model Loop and Result Storage: Each pre-defined model is trained, evaluated, and results are collected in a list for later analysis.

  10. Result Summarization and Display: Finally, metrics are summarized in a DataFrame, with results displayed for accuracy and F1 score, sorted to show the models with the best performance.

RandomOverSampler's Role in Model Performance¶

RandomOverSampler is a specific type of resampling focused on oversampling by duplicating samples from minority classes without generating synthetic data. It randomly duplicates existing samples until the minority classes reach the same size as the majority class. RandomOverSampler is crucial in mitigating class imbalance by ensuring that the minority classes are sufficiently represented. This can lead to:

  • Improved Recall for minority classes, as the model is exposed to a balanced number of examples from each class.
  • Increased F1 Score due to a better balance between precision and recall across all classes.
  • Reduction of Majority Class Bias, leading to fairer and more reliable predictions across the severity levels in the dataset.
  • Simple and Fast: Since it duplicates actual data points rather than creating synthetic ones, it is computationally faster and straightforward to implement.

Limitations:

  • Increased Overfitting Risk: Duplicating minority-class samples can lead to overfitting, especially if the model memorizes duplicated samples rather than learning generalizable features. This is particularly a concern with smaller datasets.
  • Lack of Diversity in Data: RandomOverSampler does not add any new or diverse information to the dataset. In contrast, methods like SMOTE introduce some variation by creating synthetic samples, which can sometimes help the model generalize better.

Modifying the data to get better results¶

  • Here we will merge few similar classes to overcome the performance issue caused by a huge imbalance in the dataset.
  • Balancing of the dataset is also not giving better performance
  • The folliwing class modification will be done:
    • Accident level 1 will be mapped to Low accident severity
    • Accident level 2 and 3 will be mapped to Medium accident severity
    • Accident level 4 and 5 will be mapped to High accident severity
In [ ]:
# Load data
df = data_original.copy()
df = df[['Description', 'Accident Level']]  # Columns with text and labels

# Calculate value counts and percentages for the 'Accident Level' column in original dataset
value_counts = df['Accident Level'].value_counts()
percentages = df['Accident Level'].value_counts(normalize=True) * 100

# Plot the bar chart
plt.figure(figsize=(10, 6))
value_counts.plot(kind='bar', color='skyblue')
plt.title('Value Counts and Percentages for Accident Level Column')
plt.ylabel('Count')
plt.xlabel('Accident Level')

# Show the counts and percentages on top of the bars
for i, (count, pct) in enumerate(zip(value_counts, percentages)):
    plt.text(i, count + 0.5, f"{count} ({pct:.2f}%)", ha='center', fontweight='bold')

plt.show()
No description has been provided for this image

Insights:

  • There is a huge imbalance in the data, therefore we will try to merge the classes and get more reliable results.
In [ ]:
df.head(2)
Out[ ]:
Description Accident Level labels
0 While removing the drill rod of the Jumbo 08 f... I 0
1 During the activation of a sodium sulphide pum... I 0
In [ ]:
# Map each accident level to the corresponding severity level
def map_severity(level):
    if level == 'I':
        return 'Low'
    elif level in ['II', 'III']:
        return 'Medium'
    elif level in ['IV', 'V']:
        return 'High'

# Apply the mapping function to create a new column
df['Accident Severity'] = df['Accident Level'].apply(map_severity)

# Display the updated DataFrame
df.head()
Out[ ]:
Description Accident Level Accident Severity
0 While removing the drill rod of the Jumbo 08 f... I Low
1 During the activation of a sodium sulphide pum... I Low
2 In the sub-station MILPO located at level +170... I Low
3 Being 9:45 am. approximately in the Nv. 1880 C... I Low
4 Approximately at 11:45 a.m. in circumstances t... IV High
In [ ]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
# Encode labels
label_encoder = LabelEncoder()
df['labels'] = label_encoder.fit_transform(df['Accident Severity'])

# Split data into train and test
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df['Description'].tolist(), df['labels'].tolist(), test_size=0.2, random_state=42
)

df.head()
Out[ ]:
Description Accident Level Accident Severity labels
0 While removing the drill rod of the Jumbo 08 f... I Low 1
1 During the activation of a sodium sulphide pum... I Low 1
2 In the sub-station MILPO located at level +170... I Low 1
3 Being 9:45 am. approximately in the Nv. 1880 C... I Low 1
4 Approximately at 11:45 a.m. in circumstances t... IV High 0
In [ ]:
# Calculate value counts and percentages for the 'Accident Level' column in the dataset containing the merged accident class levels.
value_counts = df['Accident Severity'].value_counts()
percentages = df['Accident Severity'].value_counts(normalize=True) * 100

# Plot the bar chart
plt.figure(figsize=(10, 6))
value_counts.plot(kind='bar', color='skyblue')
plt.title('Value Counts and Percentages for Accident Severity Column')
plt.ylabel('Count')
plt.xlabel('Accident Severity')

# Show the counts and percentages on top of the bars
for i, (count, pct) in enumerate(zip(value_counts, percentages)):
    plt.text(i, count + 0.5, f"{count} ({pct:.2f}%)", ha='center', fontweight='bold')

plt.show()
No description has been provided for this image

Insights:

  • After merging the accident levels we get comparatilvely less imbalanced Accident Severity column.
  • We are expecting to get better performance after merging the classes.
In [ ]:
from transformers import TrainingArguments, Trainer
from datasets import Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
import pandas as pd
import warnings
import torch

warnings.filterwarnings("ignore")

# Function to compute metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    accuracy = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

# Function to generate a classification report
def get_classification_report(labels, preds):
    return classification_report(labels, preds, output_dict=True)

# Training function for each model
def train_and_evaluate_model(model_pretrained, tokenizer_class, model_class, train_texts, train_labels, test_texts, test_labels):
    tokenizer = tokenizer_class.from_pretrained(model_pretrained)
    model = model_class.from_pretrained(model_pretrained, num_labels=len(set(train_labels)))

    # Ensure all model parameters are contiguous
    for param in model.parameters():
        if not param.is_contiguous():
            param.data = param.data.contiguous()

    # Tokenize the datasets
    train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512)
    test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=512)

    # Convert tokenized data to Dataset format
    train_dataset = Dataset.from_dict({
        "input_ids": train_encodings["input_ids"],
        "attention_mask": train_encodings["attention_mask"],
        "labels": train_labels
    })
    test_dataset = Dataset.from_dict({
        "input_ids": test_encodings["input_ids"],
        "attention_mask": test_encodings["attention_mask"],
        "labels": test_labels
    })

    # Define training arguments
    training_args = TrainingArguments(
        output_dir='./results',
        num_train_epochs=10,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir='./logs',
        logging_steps=10,
        evaluation_strategy="epoch",
        load_best_model_at_end=True,
        save_strategy="epoch",
        metric_for_best_model="f1",
        greater_is_better=True,
    )

    # Define Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        compute_metrics=compute_metrics
    )

    # Train and evaluate
    trainer.train()

    # Get final predictions and generate classification reports
    train_preds = trainer.predict(train_dataset)
    test_preds = trainer.predict(test_dataset)

    train_report = get_classification_report(train_preds.label_ids, train_preds.predictions.argmax(-1))
    test_report = get_classification_report(test_preds.label_ids, test_preds.predictions.argmax(-1))

    # Get accuracy and f1 metrics for summary table
    train_accuracy = train_report["accuracy"]
    test_accuracy = test_report["accuracy"]
    train_f1 = train_report["weighted avg"]["f1-score"]
    test_f1 = test_report["weighted avg"]["f1-score"]

    return train_report, test_report, train_accuracy, test_accuracy, train_f1, test_f1

# Define models and tokenizers
models = {
    "BERT": (BertTokenizer, BertForSequenceClassification, 'bert-base-uncased'),
    "RoBERTa": (RobertaTokenizer, RobertaForSequenceClassification, 'roberta-base'),
    "DistilBERT": (DistilBertTokenizer, DistilBertForSequenceClassification, 'distilbert-base-uncased'),
    "XLNet": (XLNetTokenizer, XLNetForSequenceClassification, 'xlnet-base-cased'),
    "ALBERT": (AlbertTokenizer, AlbertForSequenceClassification, 'albert-base-v2')
}

# Initialize a list to store results for each model
model_results = []

# Loop through models
for model_name, (tokenizer_class, model_class, model_pretrained) in models.items():
    print(f"Training {model_name}...")
    train_report, test_report, train_accuracy, test_accuracy, train_f1, test_f1 = train_and_evaluate_model(
        model_pretrained, tokenizer_class, model_class, train_texts, train_labels, test_texts, test_labels
    )

    # Append the best epoch's classification report and metrics for each model to model_results list
    model_results.append({
        "Model": model_name,
        "Train Accuracy": train_accuracy,
        "Test Accuracy": test_accuracy,
        "Train F1": train_f1,
        "Test F1": test_f1,
        "Train Classification Report": train_report,
        "Test Classification Report": test_report
    })

    print(f"Finished training {model_name}.\n")
    print(f"Train Classification Report for {model_name}:\n", pd.DataFrame(train_report).T)
    print(f"Test Classification Report for {model_name}:\n", pd.DataFrame(test_report).T)

# Convert results to DataFrame for summary table
summary_df = pd.DataFrame(model_results)[["Model", "Train Accuracy", "Test Accuracy", "Train F1", "Test F1"]]

# Display summary tables ordered by Test Accuracy and Test F1
print("\nSummary Table - Ordered by Test Accuracy (Descending):")
print(summary_df.sort_values(by="Test Accuracy", ascending=False).to_string(index=False))

print("\nSummary Table - Ordered by Test F1 Score (Descending):")
print(summary_df.sort_values(by="Test F1", ascending=False).to_string(index=False))
Training BERT...
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 01:11, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 0.855400 0.743329 0.800000 0.640000 0.800000 0.711111
2 0.626700 0.626402 0.800000 0.640000 0.800000 0.711111
3 0.734900 0.622588 0.800000 0.640000 0.800000 0.711111
4 0.637000 0.589013 0.800000 0.640000 0.800000 0.711111
5 0.541500 0.606670 0.800000 0.681077 0.800000 0.727581
6 0.483200 0.707722 0.682353 0.670844 0.682353 0.676521
7 0.106100 0.851349 0.811765 0.697023 0.811765 0.747698
8 0.143000 1.069056 0.800000 0.724706 0.800000 0.748023
9 0.278100 1.557687 0.600000 0.684130 0.600000 0.636531
10 0.143500 1.821103 0.647059 0.688259 0.647059 0.666051

Finished training BERT.

Train Classification Report for BERT:
               precision    recall  f1-score     support
0              0.966667  0.906250  0.935484   32.000000
1              0.980237  1.000000  0.990020  248.000000
2              0.964912  0.916667  0.940171   60.000000
accuracy       0.976471  0.976471  0.976471    0.976471
macro avg      0.970605  0.940972  0.955225  340.000000
weighted avg   0.976256  0.976471  0.976090  340.000000
Test Classification Report for BERT:
               precision    recall  f1-score  support
0              0.000000  0.000000  0.000000      6.0
1              0.825000  0.970588  0.891892     68.0
2              0.500000  0.181818  0.266667     11.0
accuracy       0.800000  0.800000  0.800000      0.8
macro avg      0.441667  0.384135  0.386186     85.0
weighted avg   0.724706  0.800000  0.748023     85.0
Training RoBERTa...
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 01:19, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 1.118200 1.047989 0.800000 0.640000 0.800000 0.711111
2 0.630900 0.653031 0.800000 0.640000 0.800000 0.711111
3 0.767500 0.606745 0.800000 0.640000 0.800000 0.711111
4 0.686900 0.580249 0.800000 0.640000 0.800000 0.711111
5 0.639300 0.627484 0.800000 0.710489 0.800000 0.729843
6 0.901200 0.657368 0.705882 0.680112 0.705882 0.692373
7 0.321200 0.814894 0.764706 0.701604 0.764706 0.726545
8 0.292000 1.048538 0.741176 0.718972 0.741176 0.725165
9 0.219300 1.454005 0.658824 0.727413 0.658824 0.681675
10 0.234600 1.804222 0.717647 0.739312 0.717647 0.725391

Finished training RoBERTa.

Train Classification Report for RoBERTa:
               precision    recall  f1-score     support
0              1.000000  0.031250  0.060606   32.000000
1              0.814570  0.991935  0.894545  248.000000
2              0.702703  0.433333  0.536082   60.000000
accuracy       0.802941  0.802941  0.802941    0.802941
macro avg      0.839091  0.485506  0.497078  340.000000
weighted avg   0.812281  0.802941  0.752799  340.000000
Test Classification Report for RoBERTa:
               precision    recall  f1-score  support
0              0.000000  0.000000  0.000000      6.0
1              0.807229  0.985294  0.887417     68.0
2              0.500000  0.090909  0.153846     11.0
accuracy       0.800000  0.800000  0.800000      0.8
macro avg      0.435743  0.358734  0.347088     85.0
weighted avg   0.710489  0.800000  0.729843     85.0
Training DistilBERT...
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 00:46, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 1.014300 0.911074 0.800000 0.640000 0.800000 0.711111
2 0.649000 0.631885 0.800000 0.640000 0.800000 0.711111
3 0.714900 0.621069 0.800000 0.640000 0.800000 0.711111
4 0.632800 0.616022 0.800000 0.640000 0.800000 0.711111
5 0.593700 0.612611 0.788235 0.638095 0.788235 0.705263
6 0.511200 0.697658 0.752941 0.690092 0.752941 0.719589
7 0.280900 0.857517 0.800000 0.640000 0.800000 0.711111
8 0.185300 0.941398 0.729412 0.708906 0.729412 0.718146
9 0.293300 1.426194 0.611765 0.718460 0.611765 0.652975
10 0.102400 1.674249 0.635294 0.754812 0.635294 0.675424

Finished training DistilBERT.

Train Classification Report for DistilBERT:
               precision    recall  f1-score     support
0              0.909091  0.625000  0.740741   32.000000
1              0.953668  0.995968  0.974359  248.000000
2              0.830508  0.816667  0.823529   60.000000
accuracy       0.929412  0.929412  0.929412    0.929412
macro avg      0.897756  0.812545  0.846210  340.000000
weighted avg   0.927738  0.929412  0.925754  340.000000
Test Classification Report for DistilBERT:
               precision    recall  f1-score    support
0              0.000000  0.000000  0.000000   6.000000
1              0.826667  0.911765  0.867133  68.000000
2              0.222222  0.181818  0.200000  11.000000
accuracy       0.752941  0.752941  0.752941   0.752941
macro avg      0.349630  0.364528  0.355711  85.000000
weighted avg   0.690092  0.752941  0.719589  85.000000
Training XLNet...
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['logits_proj.bias', 'logits_proj.weight', 'sequence_summary.summary.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 01:30, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 0.814900 0.656850 0.800000 0.640000 0.800000 0.711111
2 0.641800 0.643237 0.800000 0.640000 0.800000 0.711111
3 0.702500 0.594788 0.800000 0.640000 0.800000 0.711111
4 0.561700 0.595120 0.800000 0.640000 0.800000 0.711111
5 0.535400 0.653683 0.752941 0.741345 0.752941 0.746701
6 0.474500 0.739600 0.729412 0.704817 0.729412 0.715716
7 0.470300 1.180166 0.788235 0.645783 0.788235 0.709934
8 0.326400 0.922878 0.694118 0.744538 0.694118 0.718366
9 0.397200 1.917955 0.541176 0.777873 0.541176 0.596703
10 0.158000 1.593332 0.741176 0.719244 0.741176 0.728689

Finished training XLNet.

Train Classification Report for XLNet:
               precision    recall  f1-score     support
0              0.812500  0.812500  0.812500   32.000000
1              0.959677  0.959677  0.959677  248.000000
2              0.833333  0.833333  0.833333   60.000000
accuracy       0.923529  0.923529  0.923529    0.923529
macro avg      0.868504  0.868504  0.868504  340.000000
weighted avg   0.923529  0.923529  0.923529  340.000000
Test Classification Report for XLNet:
               precision    recall  f1-score    support
0              0.400000  0.333333  0.363636   6.000000
1              0.842857  0.867647  0.855072  68.000000
2              0.300000  0.272727  0.285714  11.000000
accuracy       0.752941  0.752941  0.752941   0.752941
macro avg      0.514286  0.491236  0.501474  85.000000
weighted avg   0.741345  0.752941  0.746701  85.000000
Training ALBERT...
Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 00:47, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 0.774600 0.641963 0.800000 0.640000 0.800000 0.711111
2 0.636300 0.617865 0.800000 0.640000 0.800000 0.711111
3 0.721200 0.623228 0.800000 0.640000 0.800000 0.711111
4 0.663100 0.615305 0.800000 0.640000 0.800000 0.711111
5 0.648300 0.780092 0.658824 0.703536 0.658824 0.674541
6 0.625700 0.570941 0.788235 0.638095 0.788235 0.705263
7 0.353000 0.757557 0.788235 0.638095 0.788235 0.705263
8 0.420200 0.759293 0.658824 0.760322 0.658824 0.699141
9 0.342400 0.738538 0.729412 0.695443 0.729412 0.709479
10 0.359900 1.180593 0.788235 0.638095 0.788235 0.705263

Finished training ALBERT.

Train Classification Report for ALBERT:
               precision    recall  f1-score     support
0              0.000000  0.000000  0.000000   32.000000
1              0.729412  1.000000  0.843537  248.000000
2              0.000000  0.000000  0.000000   60.000000
accuracy       0.729412  0.729412  0.729412    0.729412
macro avg      0.243137  0.333333  0.281179  340.000000
weighted avg   0.532042  0.729412  0.615286  340.000000
Test Classification Report for ALBERT:
               precision    recall  f1-score  support
0              0.000000  0.000000  0.000000      6.0
1              0.800000  1.000000  0.888889     68.0
2              0.000000  0.000000  0.000000     11.0
accuracy       0.800000  0.800000  0.800000      0.8
macro avg      0.266667  0.333333  0.296296     85.0
weighted avg   0.640000  0.800000  0.711111     85.0

Summary Table - Ordered by Test Accuracy (Descending):
     Model  Train Accuracy  Test Accuracy  Train F1  Test F1
      BERT        0.976471       0.800000  0.976090 0.748023
   RoBERTa        0.802941       0.800000  0.752799 0.729843
    ALBERT        0.729412       0.800000  0.615286 0.711111
DistilBERT        0.929412       0.752941  0.925754 0.719589
     XLNet        0.923529       0.752941  0.923529 0.746701

Summary Table - Ordered by Test F1 Score (Descending):
     Model  Train Accuracy  Test Accuracy  Train F1  Test F1
      BERT        0.976471       0.800000  0.976090 0.748023
     XLNet        0.923529       0.752941  0.923529 0.746701
   RoBERTa        0.802941       0.800000  0.752799 0.729843
DistilBERT        0.929412       0.752941  0.925754 0.719589
    ALBERT        0.729412       0.800000  0.615286 0.711111
In [ ]:
# Performance metrics data sorted in descending order of test accuracy
summary_df.sort_values(by="Test Accuracy", ascending=False)
Out[ ]:
Model Train Accuracy Test Accuracy Train F1 Test F1
0 BERT 0.976471 0.800000 0.976090 0.748023
1 RoBERTa 0.802941 0.800000 0.752799 0.729843
4 ALBERT 0.729412 0.800000 0.615286 0.711111
2 DistilBERT 0.929412 0.752941 0.925754 0.719589
3 XLNet 0.923529 0.752941 0.923529 0.746701
In [ ]:
# Performance metrics data sorted in descending order of test F1 score
summary_df.sort_values(by="Test F1", ascending=False)
Out[ ]:
Model Train Accuracy Test Accuracy Train F1 Test F1
0 BERT 0.976471 0.800000 0.976090 0.748023
3 XLNet 0.923529 0.752941 0.923529 0.746701
1 RoBERTa 0.802941 0.800000 0.752799 0.729843
2 DistilBERT 0.929412 0.752941 0.925754 0.719589
4 ALBERT 0.729412 0.800000 0.615286 0.711111
Insights:¶
  • BERT appears to be having the best Test Accuracy(80%) and Test F1 Score(74.80%). However, it shows a significant amount of overfitting because there is around 15% difference in train and test metrics.
  • RoBERTa model is giving similar Test Accuracy(80%) and around 73% of Test F1 score. Moreover, there is no overfitting beause the train and test metrics are almost the same.
  • RoBERTa becomes the natural choice as the best model as it is generalizing well and giving comparable results to BERT model.

Below are the few reason why RoBERTa is chosen as the best model and how it is better than BERT:

  • RoBERTa was trained on significantly more data than BERT (160GB for RoBERTa vs. 16GB for BERT).
  • Access to more training data allows RoBERTa to better generalize patterns in natural language, making it more robust across various domains and also helps to avoid overfitting.
  • BERT uses static masking, where the same tokens are masked during every epoch of training.
  • RoBERTa employs dynamic masking, where tokens are masked differently for each epoch. This leads to a better understanding of the language structure and improves the model's learning capacity.
  • RoBERTa's pretraining helps it generalize across underrepresented classes, making it more suitable for imbalanced datasets where classes like "Medium" or "High" severity might have fewer samples.

The improvement in accuracy and F1 score after merging accident levels into three severity levels ("Low," "Medium," and "High") is primarily due to a combination of factors related to class distribution, simplification of the classification problem, and data representation:

1. Simplification of the Classification Problem¶
  • Reduced Complexity: By reducing the classification from five to three classes, the model faces a simpler classification problem. Fewer classes mean the model has fewer distinctions to learn, making it easier to generalize patterns in the data.
  • Clearer Boundaries: When fewer classes are present, the decision boundaries for each class become more defined. This helps the model to classify more accurately because it no longer needs to distinguish between similar classes (e.g., levels II and III or IV and V).
  • Less Class Overlap: In many cases, classes that are merged (like levels II and III, or IV and V) have similar characteristics in the input data. This reduction can minimize overlapping features that the model might previously have confused, leading to more reliable predictions.
2. Better Class Distribution¶
  • Balanced Representation: In multi-class classification, imbalanced classes can make training difficult for the model, as it tends to focus on the majority classes. By merging levels, you've effectively reduced some of the imbalance, giving the model a more balanced dataset, which can result in better performance.
  • Improved Sample Size per Class: Merging classes also increases the number of samples in each class. Larger sample sizes per class allow the model to learn more representative patterns for each severity level, enhancing generalization and accuracy.
3. Enhanced Metrics Calculation¶
  • Accuracy and F1-Score Sensitivity: With fewer classes and a more balanced dataset, both accuracy and F1-score metrics typically become more stable and meaningful. F1 score, especially, is sensitive to class imbalances and benefits from the reduction in complexity and increase in per-class sample size, providing a better assessment of model performance.
How This Change Impacts Performance:¶

The merging of classes effectively addresses issues of class imbalance and improves the dataset’s representation across classes. As a result:

  • Reduced Overfitting: The model is less likely to overfit on minority classes due to the increased sample sizes and simplified classification task.
  • Improved Generalization: A simpler classification task and more balanced data distribution mean the model is more likely to generalize well to new data, reflected in improved performance metrics like accuracy and F1 score.

Limitations¶

  • Loss of Granularity: Although performance improves, merging classes sacrifices some detail (i.e., the nuanced difference between adjacent accident levels). Depending on the context, this loss of detail may or may not be acceptable.

By reducing the complexity of the target variable, we have optimized the model's learning process, resulting in a more reliable, accurate classification model.

Balancing data using SMOTE and Running the models¶

In [ ]:
# Install required libraries
!pip install transformers datasets torch scikit-learn imbalanced-learn

from transformers import (
    BertTokenizer, BertForSequenceClassification,
    RobertaTokenizer, RobertaForSequenceClassification,
    DistilBertTokenizer, DistilBertForSequenceClassification,
    XLNetTokenizer, XLNetForSequenceClassification,
    AlbertTokenizer, AlbertForSequenceClassification,
    Trainer, TrainingArguments, EarlyStoppingCallback
)
from datasets import Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import warnings
import torch

# Suppress all warnings
warnings.filterwarnings("ignore")

# Load data
df = data_original.copy()
df = df[['Description', 'Accident Level']]  # Columns with text and labels

# Map each accident level to the corresponding severity level
def map_severity(level):
    if level == 'I':
        return 'Low'
    elif level in ['II', 'III']:
        return 'Medium'
    elif level in ['IV', 'V']:
        return 'High'

# Apply the mapping function to create a new column
df['Accident Severity'] = df['Accident Level'].apply(map_severity)

# Encode labels to numeric format
label_encoder = LabelEncoder()
df['labels'] = label_encoder.fit_transform(df['Accident Severity'])

# Split data into training and testing sets
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df['Description'].tolist(),
    df['labels'].tolist(),
    test_size=0.2,
    random_state=42,
    stratify=df['labels']  # Ensures proportional representation in splits
)

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Step 1: Transform text data using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # Adjust max_features as needed
train_tfidf = tfidf_vectorizer.fit_transform(train_texts).toarray()
test_tfidf = tfidf_vectorizer.transform(test_texts).toarray()

# Step 2: Apply SMOTE
smote = SMOTE(random_state=42)
train_tfidf_resampled, train_labels_resampled = smote.fit_resample(train_tfidf, train_labels)

# Step 3: Map back to original text format if required
# Convert TF-IDF back into readable data, if needed (e.g., to debug or understand results)
# Here, we'll directly use `train_tfidf_resampled` and `train_labels_resampled` for model training

# Optional: Verify the class distribution after oversampling
from collections import Counter
print("Class distribution after SMOTE:", Counter(train_labels_resampled))

# Function to compute metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    accuracy = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

# Function to generate a classification report
def get_classification_report(labels, preds):
    return classification_report(labels, preds, output_dict=True)

# Training function for each model
def train_and_evaluate_model(model_pretrained, tokenizer_class, model_class, train_texts, train_labels, test_texts, test_labels):
    tokenizer = tokenizer_class.from_pretrained(model_pretrained)
    model = model_class.from_pretrained(model_pretrained, num_labels=len(set(train_labels)))

    # Ensure all model parameters are contiguous
    for param in model.parameters():
        if not param.is_contiguous():
            param.data = param.data.contiguous()

    # Tokenize the datasets
    train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512)
    test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=512)

    # Convert tokenized data to Dataset format
    train_dataset = Dataset.from_dict({
        "input_ids": train_encodings["input_ids"],
        "attention_mask": train_encodings["attention_mask"],
        "labels": train_labels
    })
    test_dataset = Dataset.from_dict({
        "input_ids": test_encodings["input_ids"],
        "attention_mask": test_encodings["attention_mask"],
        "labels": test_labels
    })

    # Define training arguments
    training_args = TrainingArguments(
        output_dir='./results',
        num_train_epochs=10,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir='./logs',
        logging_steps=10,
        evaluation_strategy="epoch",
        load_best_model_at_end=True,
        save_strategy="epoch",
        metric_for_best_model="f1",
        greater_is_better=True,
    )

    # Define Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        compute_metrics=compute_metrics
    )

    # Train and evaluate
    trainer.train()

    # Get final predictions and generate classification reports
    train_preds = trainer.predict(train_dataset)
    test_preds = trainer.predict(test_dataset)

    train_report = get_classification_report(train_preds.label_ids, train_preds.predictions.argmax(-1))
    test_report = get_classification_report(test_preds.label_ids, test_preds.predictions.argmax(-1))

    # Get accuracy and f1 metrics for summary table
    train_accuracy = train_report["accuracy"]
    test_accuracy = test_report["accuracy"]
    train_f1 = train_report["weighted avg"]["f1-score"]
    test_f1 = test_report["weighted avg"]["f1-score"]

    return train_report, test_report, train_accuracy, test_accuracy, train_f1, test_f1

# Define models and tokenizers
models = {
    "BERT": (BertTokenizer, BertForSequenceClassification, 'bert-base-uncased'),
    "RoBERTa": (RobertaTokenizer, RobertaForSequenceClassification, 'roberta-base'),
    "DistilBERT": (DistilBertTokenizer, DistilBertForSequenceClassification, 'distilbert-base-uncased'),
    "XLNet": (XLNetTokenizer, XLNetForSequenceClassification, 'xlnet-base-cased'),
    "ALBERT": (AlbertTokenizer, AlbertForSequenceClassification, 'albert-base-v2')
}

# Initialize a list to store results for each model
model_results = []

# Loop through models
for model_name, (tokenizer_class, model_class, model_pretrained) in models.items():
    print(f"Training {model_name}...")
    train_report, test_report, train_accuracy, test_accuracy, train_f1, test_f1 = train_and_evaluate_model(
        model_pretrained, tokenizer_class, model_class, train_texts, train_labels, test_texts, test_labels
    )

    # Append the best epoch's classification report and metrics for each model to model_results list
    model_results.append({
        "Model": model_name,
        "Train Accuracy": train_accuracy,
        "Test Accuracy": test_accuracy,
        "Train F1": train_f1,
        "Test F1": test_f1,
        "Train Classification Report": train_report,
        "Test Classification Report": test_report
    })

    print(f"Finished training {model_name}.\n")
    print(f"Train Classification Report for {model_name}:\n", pd.DataFrame(train_report).T)
    print(f"Test Classification Report for {model_name}:\n", pd.DataFrame(test_report).T)

# Convert results to DataFrame for summary table
summary_df = pd.DataFrame(model_results)[["Model", "Train Accuracy", "Test Accuracy", "Train F1", "Test F1"]]

# Display summary tables ordered by Test Accuracy and Test F1
print("\nSummary Table - Ordered by Test Accuracy (Descending):")
print(summary_df.sort_values(by="Test Accuracy", ascending=False).to_string(index=False))

print("\nSummary Table - Ordered by Test F1 Score (Descending):")
print(summary_df.sort_values(by="Test F1", ascending=False).to_string(index=False))
Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.46.2)
Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (3.1.0)
Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (2.5.0+cu121)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.5.2)
Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.10/dist-packages (0.12.4)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.16.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.23.2 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.26.2)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (24.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.2)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2024.9.11)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.32.3)
Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.4.5)
Requirement already satisfied: tokenizers<0.21,>=0.20 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.20.3)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.6)
Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (17.0.0)
Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.3.8)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (2.2.2)
Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.5.0)
Requirement already satisfied: multiprocess<0.70.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.70.16)
Requirement already satisfied: fsspec<=2024.9.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets) (2024.9.0)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.10.10)
Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch) (4.12.2)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch) (3.1.4)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch) (1.3.0)
Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.13.1)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.5.0)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.4.3)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (24.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.5.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.1.0)
Requirement already satisfied: yarl<2.0,>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.17.1)
Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2024.8.30)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch) (3.0.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)
Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from yarl<2.0,>=1.12.0->aiohttp->datasets) (0.2.0)
Class distribution after SMOTE: Counter({1: 253, 0: 253, 2: 253})
Training BERT...
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 01:45, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 0.816100 0.803703 0.741176 0.549343 0.741176 0.631002
2 0.732700 0.748034 0.741176 0.549343 0.741176 0.631002
3 0.686600 0.743167 0.741176 0.549343 0.741176 0.631002
4 0.618300 0.721940 0.741176 0.549343 0.741176 0.631002
5 0.634400 0.753738 0.729412 0.606265 0.729412 0.642988
6 0.454700 0.868172 0.752941 0.670206 0.752941 0.672588
7 0.395800 0.916208 0.635294 0.644522 0.635294 0.637331
8 0.139900 1.121634 0.635294 0.632941 0.635294 0.632157
9 0.142800 1.560666 0.647059 0.621390 0.647059 0.631373
10 0.074200 1.769029 0.729412 0.649193 0.729412 0.662836

Finished training BERT.

Train Classification Report for BERT:
               precision    recall  f1-score     support
0              1.000000  0.100000  0.181818   30.000000
1              0.897163  1.000000  0.945794  253.000000
2              0.836364  0.807018  0.821429   57.000000
accuracy       0.888235  0.888235  0.888235    0.888235
macro avg      0.911176  0.635673  0.649680  340.000000
weighted avg   0.896044  0.888235  0.857535  340.000000
Test Classification Report for BERT:
               precision    recall  f1-score    support
0              0.000000  0.000000  0.000000   8.000000
1              0.756098  0.984127  0.855172  63.000000
2              0.666667  0.142857  0.235294  14.000000
accuracy       0.752941  0.752941  0.752941   0.752941
macro avg      0.474255  0.375661  0.363489  85.000000
weighted avg   0.670206  0.752941  0.672588  85.000000
Training RoBERTa...
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 01:52, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 1.106400 1.041591 0.741176 0.549343 0.741176 0.631002
2 0.777200 0.777403 0.741176 0.549343 0.741176 0.631002
3 0.723000 0.807094 0.741176 0.549343 0.741176 0.631002
4 0.631600 0.781504 0.741176 0.549343 0.741176 0.631002
5 0.832200 0.798861 0.741176 0.549343 0.741176 0.631002
6 0.490500 0.818245 0.729412 0.700276 0.729412 0.701663
7 0.366500 1.227772 0.705882 0.634804 0.705882 0.667270
8 0.400700 1.308030 0.705882 0.681130 0.705882 0.690684
9 0.478600 1.740013 0.741176 0.636003 0.741176 0.650081
10 0.206900 1.676201 0.717647 0.633183 0.717647 0.655688

Finished training RoBERTa.

Train Classification Report for RoBERTa:
               precision    recall  f1-score     support
0              0.850000  0.566667  0.680000   30.000000
1              0.968750  0.980237  0.974460  253.000000
2              0.750000  0.842105  0.793388   57.000000
accuracy       0.920588  0.920588  0.920588    0.920588
macro avg      0.856250  0.796336  0.815949  340.000000
weighted avg   0.921599  0.920588  0.918122  340.000000
Test Classification Report for RoBERTa:
               precision    recall  f1-score    support
0              0.500000  0.125000  0.200000   8.000000
1              0.788732  0.888889  0.835821  63.000000
2              0.416667  0.357143  0.384615  14.000000
accuracy       0.729412  0.729412  0.729412   0.729412
macro avg      0.568466  0.457011  0.473479  85.000000
weighted avg   0.700276  0.729412  0.701663  85.000000
Training DistilBERT...
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 00:57, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 1.004200 0.942767 0.741176 0.549343 0.741176 0.631002
2 0.730900 0.741981 0.741176 0.549343 0.741176 0.631002
3 0.727700 0.742686 0.741176 0.549343 0.741176 0.631002
4 0.657800 0.730377 0.741176 0.549343 0.741176 0.631002
5 0.708300 0.724185 0.741176 0.549343 0.741176 0.631002
6 0.462600 0.746471 0.788235 0.811127 0.788235 0.728466
7 0.201400 0.864243 0.717647 0.643765 0.717647 0.669315
8 0.205500 1.136860 0.682353 0.644954 0.682353 0.662099
9 0.199700 1.394603 0.741176 0.645272 0.741176 0.675758
10 0.026900 1.639899 0.694118 0.619534 0.694118 0.647528

Finished training DistilBERT.

Train Classification Report for DistilBERT:
               precision    recall  f1-score     support
0              0.806452  0.833333  0.819672   30.000000
1              0.965385  0.992095  0.978558  253.000000
2              0.938776  0.807018  0.867925   57.000000
accuracy       0.947059  0.947059  0.947059    0.947059
macro avg      0.903537  0.877482  0.888718  340.000000
weighted avg   0.946900  0.947059  0.945991  340.000000
Test Classification Report for DistilBERT:
               precision    recall  f1-score    support
0              0.666667  0.250000  0.363636   8.000000
1              0.787500  1.000000  0.881119  63.000000
2              1.000000  0.142857  0.250000  14.000000
accuracy       0.788235  0.788235  0.788235   0.788235
macro avg      0.818056  0.464286  0.498252  85.000000
weighted avg   0.811127  0.788235  0.728466  85.000000
Training XLNet...
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['logits_proj.bias', 'logits_proj.weight', 'sequence_summary.summary.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 02:31, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 0.722600 0.761581 0.741176 0.549343 0.741176 0.631002
2 0.737200 0.733592 0.741176 0.549343 0.741176 0.631002
3 0.681000 0.745367 0.741176 0.549343 0.741176 0.631002
4 0.547000 0.833539 0.741176 0.549343 0.741176 0.631002
5 0.748700 0.844072 0.741176 0.555882 0.741176 0.635294
6 0.557600 0.827177 0.741176 0.645272 0.741176 0.675758
7 0.490700 0.999443 0.752941 0.613055 0.752941 0.674115
8 0.267400 1.041874 0.717647 0.558170 0.717647 0.627941
9 0.306800 1.019170 0.717647 0.656401 0.717647 0.685430
10 0.221500 1.149069 0.600000 0.656092 0.600000 0.619913

Finished training XLNet.

Train Classification Report for XLNet:
               precision    recall  f1-score     support
0              0.916667  0.733333  0.814815   30.000000
1              0.996047  0.996047  0.996047  253.000000
2              0.873016  0.964912  0.916667   57.000000
accuracy       0.967647  0.967647  0.967647    0.967647
macro avg      0.928577  0.898098  0.909176  340.000000
weighted avg   0.968417  0.967647  0.966748  340.000000
Test Classification Report for XLNet:
               precision    recall  f1-score    support
0              0.000000  0.000000  0.000000   8.000000
1              0.794118  0.857143  0.824427  63.000000
2              0.411765  0.500000  0.451613  14.000000
accuracy       0.717647  0.717647  0.717647   0.717647
macro avg      0.401961  0.452381  0.425347  85.000000
weighted avg   0.656401  0.717647  0.685430  85.000000
Training ALBERT...
Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 01:17, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 0.735000 0.740522 0.741176 0.549343 0.741176 0.631002
2 0.704500 0.720767 0.741176 0.549343 0.741176 0.631002
3 0.658700 0.799433 0.741176 0.549343 0.741176 0.631002
4 0.641500 0.738424 0.741176 0.549343 0.741176 0.631002
5 0.710900 0.794459 0.741176 0.549343 0.741176 0.631002
6 0.556000 0.842303 0.705882 0.632974 0.705882 0.667252
7 0.449200 0.891095 0.658824 0.652738 0.658824 0.650591
8 0.300700 1.006667 0.623529 0.630531 0.623529 0.619780
9 0.429300 0.904058 0.611765 0.722003 0.611765 0.626022
10 0.579400 0.938420 0.670588 0.595798 0.670588 0.630115

Finished training ALBERT.

Train Classification Report for ALBERT:
               precision    recall  f1-score     support
0              0.000000  0.000000  0.000000   30.000000
1              0.949612  0.968379  0.958904  253.000000
2              0.597561  0.859649  0.705036   57.000000
accuracy       0.864706  0.864706  0.864706    0.864706
macro avg      0.515724  0.609343  0.554647  340.000000
weighted avg   0.806803  0.864706  0.831735  340.000000
Test Classification Report for ALBERT:
               precision    recall  f1-score    support
0              0.000000  0.000000  0.000000   8.000000
1              0.774648  0.873016  0.820896  63.000000
2              0.357143  0.357143  0.357143  14.000000
accuracy       0.705882  0.705882  0.705882   0.705882
macro avg      0.377264  0.410053  0.392679  85.000000
weighted avg   0.632974  0.705882  0.667252  85.000000

Summary Table - Ordered by Test Accuracy (Descending):
     Model  Train Accuracy  Test Accuracy  Train F1  Test F1
DistilBERT        0.947059       0.788235  0.945991 0.728466
      BERT        0.888235       0.752941  0.857535 0.672588
   RoBERTa        0.920588       0.729412  0.918122 0.701663
     XLNet        0.967647       0.717647  0.966748 0.685430
    ALBERT        0.864706       0.705882  0.831735 0.667252

Summary Table - Ordered by Test F1 Score (Descending):
     Model  Train Accuracy  Test Accuracy  Train F1  Test F1
DistilBERT        0.947059       0.788235  0.945991 0.728466
   RoBERTa        0.920588       0.729412  0.918122 0.701663
     XLNet        0.967647       0.717647  0.966748 0.685430
      BERT        0.888235       0.752941  0.857535 0.672588
    ALBERT        0.864706       0.705882  0.831735 0.667252
Insights:¶
  • DistilBERT is giving best results for both test accuracy (78.82%) and Test F1 score(72.85%).
  • After balancing the data Test accuracy of best model reduced a bit from 80% to 78.82% and f1 score reduced from 73% to 72.84%.
  • We are getting decent precision, recall and f1 score for minority classes as shown in the classification report.
  • We will now check the performance of the models after using Random over sampler.

Balancing data using Random Oversampler and Running the models¶

In [ ]:
# Install required libraries
!pip install transformers datasets torch scikit-learn imbalanced-learn

from transformers import (
    BertTokenizer, BertForSequenceClassification,
    RobertaTokenizer, RobertaForSequenceClassification,
    DistilBertTokenizer, DistilBertForSequenceClassification,
    XLNetTokenizer, XLNetForSequenceClassification,
    AlbertTokenizer, AlbertForSequenceClassification,
    Trainer, TrainingArguments, EarlyStoppingCallback
)
from datasets import Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
import pandas as pd
import warnings
import torch

# Suppress all warnings
warnings.filterwarnings("ignore")

# Load data
df = data_original.copy()
df = df[['Description', 'Accident Level']]  # Columns with text and labels

# Map each accident level to the corresponding severity level
def map_severity(level):
    if level == 'I':
        return 'Low'
    elif level in ['II', 'III']:
        return 'Medium'
    elif level in ['IV', 'V']:
        return 'High'

# Apply the mapping function to create a new column
df['Accident Severity'] = df['Accident Level'].apply(map_severity)

# Encode labels to numeric format
label_encoder = LabelEncoder()
df['labels'] = label_encoder.fit_transform(df['Accident Severity'])

# Split data into training and testing sets
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df['Description'].tolist(),
    df['labels'].tolist(),
    test_size=0.2,
    random_state=42,
    stratify=df['labels']  # Ensures proportional representation in splits
)

# Initialize RandomOverSampler -----------------------------------------------------------
ros = RandomOverSampler(random_state=42)

# Reshape train_texts for oversampling (required by RandomOverSampler)
train_texts_df = pd.DataFrame(train_texts, columns=['Description'])

# Apply RandomOverSampler to training data
train_texts_resampled, train_labels_resampled = ros.fit_resample(train_texts_df, train_labels)

# Convert back to lists
train_texts_resampled = train_texts_resampled['Description'].tolist()

# Optional: Verify the class distribution after oversampling
from collections import Counter
print("Class distribution after oversampling:", Counter(train_labels_resampled))

# Function to compute metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    accuracy = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

# Function to generate a classification report
def get_classification_report(labels, preds):
    return classification_report(labels, preds, output_dict=True)

# Training function for each model
def train_and_evaluate_model(model_pretrained, tokenizer_class, model_class, train_texts, train_labels, test_texts, test_labels):
    tokenizer = tokenizer_class.from_pretrained(model_pretrained)
    model = model_class.from_pretrained(model_pretrained, num_labels=len(set(train_labels)))

    # Ensure all model parameters are contiguous
    for param in model.parameters():
        if not param.is_contiguous():
            param.data = param.data.contiguous()

    # Tokenize the datasets
    train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512)
    test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=512)

    # Convert tokenized data to Dataset format
    train_dataset = Dataset.from_dict({
        "input_ids": train_encodings["input_ids"],
        "attention_mask": train_encodings["attention_mask"],
        "labels": train_labels
    })
    test_dataset = Dataset.from_dict({
        "input_ids": test_encodings["input_ids"],
        "attention_mask": test_encodings["attention_mask"],
        "labels": test_labels
    })

    # Define training arguments
    training_args = TrainingArguments(
        output_dir='./results',
        num_train_epochs=10,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir='./logs',
        logging_steps=10,
        evaluation_strategy="epoch",
        load_best_model_at_end=True,
        save_strategy="epoch",
        metric_for_best_model="f1",
        greater_is_better=True,
    )

    # Define Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        compute_metrics=compute_metrics
    )

    # Train and evaluate
    trainer.train()

    # Get final predictions and generate classification reports
    train_preds = trainer.predict(train_dataset)
    test_preds = trainer.predict(test_dataset)

    train_report = get_classification_report(train_preds.label_ids, train_preds.predictions.argmax(-1))
    test_report = get_classification_report(test_preds.label_ids, test_preds.predictions.argmax(-1))

    # Get accuracy and f1 metrics for summary table
    train_accuracy = train_report["accuracy"]
    test_accuracy = test_report["accuracy"]
    train_f1 = train_report["weighted avg"]["f1-score"]
    test_f1 = test_report["weighted avg"]["f1-score"]

    return train_report, test_report, train_accuracy, test_accuracy, train_f1, test_f1

# Define models and tokenizers
models = {
    "BERT": (BertTokenizer, BertForSequenceClassification, 'bert-base-uncased'),
    "RoBERTa": (RobertaTokenizer, RobertaForSequenceClassification, 'roberta-base'),
    "DistilBERT": (DistilBertTokenizer, DistilBertForSequenceClassification, 'distilbert-base-uncased'),
    "XLNet": (XLNetTokenizer, XLNetForSequenceClassification, 'xlnet-base-cased'),
    "ALBERT": (AlbertTokenizer, AlbertForSequenceClassification, 'albert-base-v2')
}

# Initialize a list to store results for each model
model_results = []

# Loop through models
for model_name, (tokenizer_class, model_class, model_pretrained) in models.items():
    print(f"Training {model_name}...")
    train_report, test_report, train_accuracy, test_accuracy, train_f1, test_f1 = train_and_evaluate_model(
        model_pretrained, tokenizer_class, model_class, train_texts, train_labels, test_texts, test_labels
    )

    # Append the best epoch's classification report and metrics for each model to model_results list
    model_results.append({
        "Model": model_name,
        "Train Accuracy": train_accuracy,
        "Test Accuracy": test_accuracy,
        "Train F1": train_f1,
        "Test F1": test_f1,
        "Train Classification Report": train_report,
        "Test Classification Report": test_report
    })

    print(f"Finished training {model_name}.\n")
    print(f"Train Classification Report for {model_name}:\n", pd.DataFrame(train_report).T)
    print(f"Test Classification Report for {model_name}:\n", pd.DataFrame(test_report).T)

# Convert results to DataFrame for summary table
summary_df = pd.DataFrame(model_results)[["Model", "Train Accuracy", "Test Accuracy", "Train F1", "Test F1"]]

# Display summary tables ordered by Test Accuracy and Test F1
print("\nSummary Table - Ordered by Test Accuracy (Descending):")
print(summary_df.sort_values(by="Test Accuracy", ascending=False).to_string(index=False))

print("\nSummary Table - Ordered by Test F1 Score (Descending):")
print(summary_df.sort_values(by="Test F1", ascending=False).to_string(index=False))
Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.46.2)
Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (3.1.0)
Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (2.5.0+cu121)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.5.2)
Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.10/dist-packages (0.12.4)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.16.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.23.2 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.26.2)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (24.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.2)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2024.9.11)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.32.3)
Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.4.5)
Requirement already satisfied: tokenizers<0.21,>=0.20 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.20.3)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.6)
Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (17.0.0)
Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.3.8)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (2.2.2)
Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.5.0)
Requirement already satisfied: multiprocess<0.70.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.70.16)
Requirement already satisfied: fsspec<=2024.9.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets) (2024.9.0)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.10.10)
Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch) (4.12.2)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch) (3.1.4)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch) (1.3.0)
Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.13.1)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.5.0)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.4.3)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (24.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.5.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.1.0)
Requirement already satisfied: yarl<2.0,>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.17.1)
Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2024.8.30)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch) (3.0.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)
Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from yarl<2.0,>=1.12.0->aiohttp->datasets) (0.2.0)
Class distribution after oversampling: Counter({1: 253, 0: 253, 2: 253})
Training BERT...
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 01:55, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 0.956700 0.877320 0.741176 0.549343 0.741176 0.631002
2 0.743400 0.737356 0.741176 0.549343 0.741176 0.631002
3 0.670900 0.732263 0.741176 0.549343 0.741176 0.631002
4 0.584800 0.750003 0.741176 0.549343 0.741176 0.631002
5 0.736100 0.854274 0.611765 0.619195 0.611765 0.606275
6 0.387800 0.881079 0.741176 0.636003 0.741176 0.650081
7 0.174900 1.079640 0.694118 0.636471 0.694118 0.659328
8 0.277300 1.566256 0.729412 0.583971 0.729412 0.646812
9 0.043600 1.720153 0.682353 0.651947 0.682353 0.654199
10 0.015900 2.148781 0.717647 0.544720 0.717647 0.619339

Finished training BERT.

Train Classification Report for BERT:
               precision    recall  f1-score     support
0              0.933333  0.933333  0.933333   30.000000
1              0.992157  1.000000  0.996063  253.000000
2              0.981818  0.947368  0.964286   57.000000
accuracy       0.985294  0.985294  0.985294    0.985294
macro avg      0.969103  0.960234  0.964561  340.000000
weighted avg   0.985233  0.985294  0.985201  340.000000
Test Classification Report for BERT:
               precision    recall  f1-score    support
0              0.200000  0.125000  0.153846   8.000000
1              0.777778  0.888889  0.829630  63.000000
2              0.250000  0.142857  0.181818  14.000000
accuracy       0.694118  0.694118  0.694118   0.694118
macro avg      0.409259  0.385582  0.388431  85.000000
weighted avg   0.636471  0.694118  0.659328  85.000000
Training RoBERTa...
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 02:08, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 1.106400 1.041592 0.741176 0.549343 0.741176 0.631002
2 0.772900 0.774362 0.741176 0.549343 0.741176 0.631002
3 0.724100 0.811970 0.741176 0.549343 0.741176 0.631002
4 0.634300 0.792809 0.741176 0.549343 0.741176 0.631002
5 0.846400 0.727925 0.741176 0.549343 0.741176 0.631002
6 0.529000 1.067424 0.741176 0.549343 0.741176 0.631002
7 0.331800 1.149616 0.717647 0.603922 0.717647 0.637024
8 0.303300 1.081576 0.788235 0.769804 0.788235 0.767828
9 0.348400 2.025714 0.564706 0.684202 0.564706 0.601261
10 0.220700 1.682578 0.741176 0.636003 0.741176 0.650081

Finished training RoBERTa.

Train Classification Report for RoBERTa:
               precision    recall  f1-score     support
0              0.958333  0.766667  0.851852   30.000000
1              0.988189  0.992095  0.990138  253.000000
2              0.870968  0.947368  0.907563   57.000000
accuracy       0.964706  0.964706  0.964706    0.964706
macro avg      0.939163  0.902043  0.916518  340.000000
weighted avg   0.965903  0.964706  0.964093  340.000000
Test Classification Report for RoBERTa:
               precision    recall  f1-score    support
0              0.666667  0.250000  0.363636   8.000000
1              0.842857  0.936508  0.887218  63.000000
2              0.500000  0.428571  0.461538  14.000000
accuracy       0.788235  0.788235  0.788235   0.788235
macro avg      0.669841  0.538360  0.570798  85.000000
weighted avg   0.769804  0.788235  0.767828  85.000000
Training DistilBERT...
tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]
vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]
config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 01:02, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 1.004200 0.942767 0.741176 0.549343 0.741176 0.631002
2 0.730900 0.741981 0.741176 0.549343 0.741176 0.631002
3 0.727700 0.742686 0.741176 0.549343 0.741176 0.631002
4 0.657800 0.730377 0.741176 0.549343 0.741176 0.631002
5 0.708300 0.724185 0.741176 0.549343 0.741176 0.631002
6 0.462600 0.746471 0.788235 0.811127 0.788235 0.728466
7 0.201400 0.864323 0.717647 0.643765 0.717647 0.669315
8 0.205000 1.137261 0.682353 0.644954 0.682353 0.662099
9 0.199200 1.399701 0.741176 0.645272 0.741176 0.675758
10 0.022400 1.669772 0.705882 0.624941 0.705882 0.654835

Finished training DistilBERT.

Train Classification Report for DistilBERT:
               precision    recall  f1-score     support
0              0.806452  0.833333  0.819672   30.000000
1              0.965385  0.992095  0.978558  253.000000
2              0.938776  0.807018  0.867925   57.000000
accuracy       0.947059  0.947059  0.947059    0.947059
macro avg      0.903537  0.877482  0.888718  340.000000
weighted avg   0.946900  0.947059  0.945991  340.000000
Test Classification Report for DistilBERT:
               precision    recall  f1-score    support
0              0.666667  0.250000  0.363636   8.000000
1              0.787500  1.000000  0.881119  63.000000
2              1.000000  0.142857  0.250000  14.000000
accuracy       0.788235  0.788235  0.788235   0.788235
macro avg      0.818056  0.464286  0.498252  85.000000
weighted avg   0.811127  0.788235  0.728466  85.000000
Training XLNet...
spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]
config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]
pytorch_model.bin:   0%|          | 0.00/467M [00:00<?, ?B/s]
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['logits_proj.bias', 'logits_proj.weight', 'sequence_summary.summary.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 02:23, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 0.722600 0.761581 0.741176 0.549343 0.741176 0.631002
2 0.737200 0.733592 0.741176 0.549343 0.741176 0.631002
3 0.681000 0.745367 0.741176 0.549343 0.741176 0.631002
4 0.547000 0.833539 0.741176 0.549343 0.741176 0.631002
5 0.748700 0.844072 0.741176 0.555882 0.741176 0.635294
6 0.557600 0.827177 0.741176 0.645272 0.741176 0.675758
7 0.490700 0.999443 0.752941 0.613055 0.752941 0.674115
8 0.267400 1.041874 0.717647 0.558170 0.717647 0.627941
9 0.306800 1.019170 0.717647 0.656401 0.717647 0.685430
10 0.221500 1.149069 0.600000 0.656092 0.600000 0.619913

model.safetensors:   0%|          | 0.00/467M [00:00<?, ?B/s]
Finished training XLNet.

Train Classification Report for XLNet:
               precision    recall  f1-score     support
0              0.916667  0.733333  0.814815   30.000000
1              0.996047  0.996047  0.996047  253.000000
2              0.873016  0.964912  0.916667   57.000000
accuracy       0.967647  0.967647  0.967647    0.967647
macro avg      0.928577  0.898098  0.909176  340.000000
weighted avg   0.968417  0.967647  0.966748  340.000000
Test Classification Report for XLNet:
               precision    recall  f1-score    support
0              0.000000  0.000000  0.000000   8.000000
1              0.794118  0.857143  0.824427  63.000000
2              0.411765  0.500000  0.451613  14.000000
accuracy       0.717647  0.717647  0.717647   0.717647
macro avg      0.401961  0.452381  0.425347  85.000000
weighted avg   0.656401  0.717647  0.685430  85.000000
Training ALBERT...
tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]
spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]
config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]
Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[430/430 01:17, Epoch 10/10]
Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 0.735000 0.740522 0.741176 0.549343 0.741176 0.631002
2 0.704500 0.720767 0.741176 0.549343 0.741176 0.631002
3 0.658700 0.799649 0.741176 0.549343 0.741176 0.631002
4 0.641900 0.737687 0.741176 0.549343 0.741176 0.631002
5 0.861400 0.732460 0.741176 0.555882 0.741176 0.635294
6 0.690300 0.795152 0.741176 0.549343 0.741176 0.631002
7 0.673900 0.824409 0.741176 0.549343 0.741176 0.631002
8 0.608300 0.753014 0.705882 0.681437 0.705882 0.687696
9 0.484400 0.795192 0.682353 0.670353 0.682353 0.670154
10 0.564800 0.758030 0.647059 0.636548 0.647059 0.637781

Finished training ALBERT.

Train Classification Report for ALBERT:
               precision    recall  f1-score     support
0              0.666667  0.066667  0.121212   30.000000
1              0.931174  0.909091  0.920000  253.000000
2              0.488889  0.771930  0.598639   57.000000
accuracy       0.811765  0.811765  0.811765    0.811765
macro avg      0.695577  0.582562  0.546617  340.000000
weighted avg   0.833687  0.811765  0.795644  340.000000
Test Classification Report for ALBERT:
               precision    recall  f1-score    support
0              0.000000  0.000000  0.000000   8.000000
1              0.836066  0.809524  0.822581  63.000000
2              0.375000  0.642857  0.473684  14.000000
accuracy       0.705882  0.705882  0.705882   0.705882
macro avg      0.403689  0.484127  0.432088  85.000000
weighted avg   0.681437  0.705882  0.687696  85.000000

Summary Table - Ordered by Test Accuracy (Descending):
     Model  Train Accuracy  Test Accuracy  Train F1  Test F1
   RoBERTa        0.964706       0.788235  0.964093 0.767828
DistilBERT        0.947059       0.788235  0.945991 0.728466
     XLNet        0.967647       0.717647  0.966748 0.685430
    ALBERT        0.811765       0.705882  0.795644 0.687696
      BERT        0.985294       0.694118  0.985201 0.659328

Summary Table - Ordered by Test F1 Score (Descending):
     Model  Train Accuracy  Test Accuracy  Train F1  Test F1
   RoBERTa        0.964706       0.788235  0.964093 0.767828
DistilBERT        0.947059       0.788235  0.945991 0.728466
    ALBERT        0.811765       0.705882  0.795644 0.687696
     XLNet        0.967647       0.717647  0.966748 0.685430
      BERT        0.985294       0.694118  0.985201 0.659328
Insights:¶
  • RoBERTa is giving best results for both test accuracy (78.82%) and Test F1 score(76.78%).
  • After balancing the data Test accuracy reduced a bit from 80% to 78.82% but f1 score improved from 73% to 76.78%.
  • We are also getting decent precision, recall and f1 scores for minority classes and these scores are better than the DistilBERT model output using SMOTE balanced data.
  • This seems to be the best overall output till now.

Best Model:¶

  • We choose RoBERTa as the best model which is trained on the balanced data obtained from Random Over Sampling.
  • After balancing the data Test accuracy reduced a bit from 80% to 78.82% but f1 score improved from 73% to 76.78%. The f1 score is the best among all the models attempted.
  • This model also predicts the minority classes effectively compared to other models.
  • Prediction of minority class was the biggest issue we faced, which got resolved to a great extent by using this model.

Random Over Sampler performed better than SMOTE because of the following reasons:

  • Textual Nature of Data: SMOTE generates synthetic samples by interpolating between existing samples, which is challenging for textual data since interpolation doesn't preserve the semantic integrity of text. ROS, on the other hand, simply duplicates existing samples, avoiding semantic distortions.
  • Transformers Handle Redundancy Well: Transformer models like BERT or RoBERTa are robust to repeated instances during training. They can still learn meaningful patterns without overfitting due to the model's architecture and regularization techniques like dropout.
  • Preservation of Original Data Distribution: ROS preserves the original data distribution and ensures the generated samples are realistic, which is critical for text classification. SMOTE may introduce unrealistic or nonsensical text samples that can mislead the model during training.
  • Simpler Preprocessing: ROS doesn't require additional computational steps like creating synthetic embeddings or tokenized data, making it more straightforward to implement in a transformer-based pipeline.

Reason why RoBERTa is chosen as the best model and how it is better than other BERT like models:

  • RoBERTa was trained on significantly more data than BERT (160GB for RoBERTa vs. 16GB for BERT).
  • Access to more training data allows RoBERTa to better generalize patterns in natural language, making it more robust across various domains and also helps to avoid overfitting.
  • BERT uses static masking, where the same tokens are masked during every epoch of training.
  • RoBERTa employs dynamic masking, where tokens are masked differently for each epoch. This leads to a better understanding of the language structure and improves the model's learning capacity.
  • RoBERTa's pretraining helps it generalize across underrepresented classes, making it more suitable for imbalanced datasets where classes like "Medium" or "High" severity might have fewer samples.

The improvement in accuracy and F1 score after merging accident levels into three severity levels ("Low," "Medium," and "High") is primarily due to a combination of factors related to class distribution, simplification of the classification problem, and data representation:

  1. Simplification of the Classification Problem
  • Reduced Complexity: By reducing the classification from five to three classes, the model faces a simpler classification problem. Fewer classes mean the model has fewer distinctions to learn, making it easier to generalize patterns in the data.
  • Clearer Boundaries: When fewer classes are present, the decision boundaries for each class become more defined. This helps the model to classify more accurately because it no longer needs to distinguish between similar classes (e.g., levels II and III or IV and V).
  • Less Class Overlap: In many cases, classes that are merged (like levels II and III, or IV and V) have similar characteristics in the input data. This reduction can minimize overlapping features that the model might previously have confused, leading to more reliable predictions.
  1. Better Class Distribution
  • Balanced Representation: In multi-class classification, imbalanced classes can make training difficult for the model, as it tends to focus on the majority classes. By merging levels, you've effectively reduced some of the imbalance, giving the model a more balanced dataset, which can result in better performance.
  • Improved Sample Size per Class: Merging classes also increases the number of samples in each class. Larger sample sizes per class allow the model to learn more representative patterns for each severity level, enhancing generalization and accuracy.
  1. Enhanced Metrics Calculation
  • Accuracy and F1-Score Sensitivity: With fewer classes and a more balanced dataset, both accuracy and F1-score metrics typically become more stable and meaningful. F1 score, especially, is sensitive to class imbalances and benefits from the reduction in complexity and increase in per-class sample size, providing a better assessment of model performance.

How This Change Impacts Performance:

The merging of classes effectively addresses issues of class imbalance and improves the dataset’s representation across classes. As a result:

  • Reduced Overfitting: The model is less likely to overfit on minority classes due to the increased sample sizes and simplified classification task.
  • Improved Generalization: A simpler classification task and more balanced data distribution mean the model is more likely to generalize well to new data, reflected in improved performance metrics like accuracy and F1 score.

Limitations:

  • Loss of Granularity: Although performance improves, merging classes sacrifices some detail like the nuanced difference between adjacent accident levels(e.g., levels IV and V). This might potentially impact decision-making.
  • Computation Requirements: Transformer models are computationally expensive and require significant resources for inference. This can be a bottleneck in real-time applications, especially in resource-constrained environments.
  • Less and highly imbalanced data: The data is less and the available data for minority class is very less which makes it harder to train the model on the minority class and get correct predictions on the minority class.
  • Oversampling Bias: While Random Over Sampling balances class distribution, it might lead the model to overfit duplicated samples, especially for minority classes. This can reduce the model's ability to identify subtle variations in underrepresented classes.
  • Generalization to New Data: The model is fine-tuned on a specific dataset, and its ability to generalize to unseen real-world accident descriptions might be limited if the dataset doesn't represent the diversity of real-world cases.
  • Dependency on Data Quality: Transformer models like RoBERTa heavily depend on high-quality, well-annotated data. Noise or ambiguities in accident descriptions can negatively impact model performance.

Possible Solution Enhancements:

  • Data Augmentation: Instead of relying solely on Random Over Sampling, we can incorporate domain-specific data augmentation techniques to create diverse and realistic training samples.
  • Incorporate Domain Knowledge: We can use domain-specific embeddings or pre-trained models fine-tuned on industry-relevant corpora to improve understanding of accident-specific terminology.
  • Active Learning: We can iteratively refine the model using active learning by labeling high-uncertainty predictions. This helps adapt the model to evolving data distributions.
  • Explainability Tools: Integrate explainability techniques (e.g., SHAP or LIME) to interpret model predictions. This is critical for real-world acceptance where decisions must be explainable.
  • Testing on Diverse Real-World Data: We can test the model on real-world accident reports from various industries and geographies to ensure robustness. Retrain or fine-tune the model with additional data as needed.
  • Class-Imbalance Mitigation: We can consider advanced balancing techniques like dynamic sampling or cost-sensitive learning to reduce dependence on oversampling methods like ROS.
  • Resource Optimization: We can explore lightweight transformer models like DistilRoBERTa or quantization techniques for efficient inference in deployment environments.

By reducing the complexity of the target variable, we have optimized the model's learning process, resulting in a more reliable, accurate classification model.

Comparison to Benchmark:¶

  • RoBERTa model is giving a significant(around 5%) improvement of Test Accuracy(78.82%) over the Naive Bayes benchmark Test accuracy of 74% and around 14% improvement i.e 76.78% of Test F1 score over the Naive Bayes benchmark Test F1 score of 63%
  • Minority classes were not getting predicted in Naive Bayes but minority classes are getting predicted well using RoBERTa.
  • Minority classes prediction improvement is the biggest improvement we achieved with this model.

Reasons of Improvement:¶

  • The significant improvement in performance metrics and minority class predictions using the fine-tuned RoBERTa model on random over-sampled data for merged accident classes stems from RoBERTa's ability to capture complex linguistic patterns and context through pre-trained transformers, which Naive Bayes lacks as it relies on simplified word independence assumptions and lacks the ability to understand context.
  • Fine-tuning RoBERTa on the accident description dataset allowed the model to align its pre-trained linguistic knowledge with the domain-specific features, significantly enhancing its performance, especially on nuanced minority class predictions.
  • Merging accident classes reduced class overlap and improved class balance.
  • Oversampling ensured better representation of minority classes, enabling RoBERTa to generalize effectively across all categories. Naive Bayes, on the other hand, is more sensitive to imbalanced distributions and struggles with such adjustments.

Implications:¶

  • Improved Decision-Making in Safety Management: The model's ability to classify accident severity levels accurately enables safety teams to prioritize responses to high-severity incidents. This helps in faster allocation of resources, reducing downtime, and preventing escalation of critical issues.
  • Enhanced Incident Reporting Efficiency:

By automating the classification of accident descriptions, the solution reduces the manual effort required in categorizing incidents. This improves reporting consistency and ensures a standardized approach to assessing severity.

  • Focus on High-Risk Areas:

With reliable predictions, businesses can identify trends in high-severity incidents, helping to design more targeted interventions like better training, updated safety protocols, or investment in protective equipment.

  • Proactive Risk Management:

By analyzing prediction trends, businesses can proactively address safety concerns before incidents occur, fostering a safer work environment.

  • Regulatory Compliance:

Automated, accurate incident categorization ensures adherence to safety reporting standards, reducing compliance risks.

Recommendations:¶

  • Adopt RoBERTa for Deployment with Oversampling:

Based on the evaluation metrics and performance on minority class prediction, the RoBERTa model is the most suitable for deployment. Its ability to handle imbalanced data using Random Over Sampler ensures improved prediction accuracy for minority classes, critical for real-world applications.

  • Monitor Model in Real-World Scenarios:

Implement continuous monitoring of the model's performance after deployment. Metrics like precision, recall, and F1-score for each severity class should be tracked to detect any drift in performance.

  • Integrate Domain Expertise:

Use the model as a decision-support tool rather than a sole decision-maker. Incorporate feedback from safety experts to fine-tune the solution periodically, improving its relevance and reliability.

  • Scalability and Regular Updates:

Ensure the model can scale to handle larger datasets or new accident description formats as the organization grows. Update the model with new data periodically to maintain accuracy and relevance.

Confidence in Recommendations¶

  • Performance Metrics Confidence:

The chosen solution demonstrates strong performance metrics score(around 80%), including high accuracy and F1-score on test data, indicating high confidence in the model's ability to generalize effectively.

  • Business Context Alignment:

The ability to predict minority classes accurately ensures that even less frequent but critical high-severity incidents are identified, aligning with business priorities in safety management.

  • Caveats:

While the model shows promise, it is essential to recognize potential limitations such as reliance on historical data quality, sensitivity to distribution shifts, and interpretability challenges in transformer models.

Closing Reflections¶

The journey of predicting accident severity levels based on textual accident descriptions has been a comprehensive exploration of various machine learning and natural language processing techniques. Started with traditional classifiers along with NLP pre-precessed data in 1st milestone, then in 2nd milestone started with traditional classifiers using BERT embeddings and evolving to fine-tuning transformer models for sequence classification, this process has provided valuable insights into the nuances of model selection, data preprocessing, and handling class imbalances.

Key Learnings:

  • Impact of Preprocessing:

Initial attempts revealed that traditional classifiers like Logistic Regression, SVM,Random Forest and Gradient Boosting performed better with raw accident descriptions rather than preprocessed text. This highlighted the importance of retaining contextual information in text for accurate representation when using embeddings like BERT.

  • Advantages of Fine-Tuning Transformers:

Transitioning to fine-tuned transformer models like BERT, RoBERTa, DistilBERT, XLNet, and ALBERT demonstrated their superior ability to capture intricate language patterns. This approach provided better overall accuracy and robustness compared to static embeddings with traditional classifiers.

  • Handling Overfitting:

Experiments with varying epochs revealed that increasing epochs led to overfitting, as seen in the widening gap between train and test metrics. Setting metric_for_best_model to "F1" and evaluating at every epoch provided a better mechanism to prevent overfitting and select the best model for generalization.

  • Challenges with Imbalanced Data:

Techniques like Random Over Sampling and SMOTE initially posed challenges in improving minority class predictions, especially with transformers. However, merging the accident levels into three broader categories and applying random oversampling significantly improved model performance on minority classes.

  • RoBERTa’s Effectiveness:

Among all tested models, RoBERTa stood out due to its robustness, superior generalization on test data, and effectiveness in predicting minority classes after merging categories. This demonstrated the importance of model architecture and pre-training corpus in domain-specific tasks.

What Would Be Done Differently Next Time?

  • Earlier Focus on Domain-Specific Insights:

The merging of accident levels into broader categories proved pivotal in achieving better performance. Next time, starting with a domain-informed analysis of class definitions and their real-world implications might streamline the process and improve results earlier.

  • Targeted Class Balancing Approaches:

While Random Over Sampling worked well eventually, exploring transformer-specific data augmentation methods or other advanced techniques like Class-Balanced Loss or Focal Loss from the outset could improve minority class predictions without oversampling.

  • Experimentation with Custom Tokenizers:

Using pre-trained tokenizers was effective, but experimenting with custom tokenizers trained on domain-specific accident descriptions might provide even better results by aligning the model vocabulary with the dataset.

  • Greater Automation of Hyperparameter Tuning:

A more structured approach using tools like Optuna or Hyperopt could automate hyperparameter tuning and potentially uncover better configurations faster than manual experimentation.

Final Reflection

This project has highlighted the complexities and rewards of applying state-of-the-art NLP models to real-world classification problems. The iterative process of model optimization, data adjustments, and evaluation provided deep insights into handling imbalanced data, avoiding overfitting, and achieving both strong overall metrics and effective minority class predictions. By combining domain-specific adjustments with advanced machine learning techniques, the solution demonstrates how modern AI can enhance safety management processes and enable more effective decision-making. The experience gained here serves as a strong foundation for tackling similar classification tasks in other domains, paving the way for continuous improvement and innovation.

References:¶

https://discuss.ai.google.dev/t/does-text-preprocessing-or-cleaning-required-for-bert-model-or-others-one/29416

https://towardsdatascience.com/part-1-data-cleaning-does-bert-need-clean-data-6a50c9c6e9fd

https://huggingface.co/google-bert/bert-base-uncased

https://www.linkedin.com/advice/1/what-most-effective-text-classification-algorithms

https://www.quora.com/Whats-the-best-Machine-Learning-algorithm-to-use-for-text-classification-if-you-use-tf-idf-and-word-embeddings-as-your-text-features

https://www.kaggle.com/datasets/ihmstefanini/industrial-safety-and-health-analytics-database

https://huggingface.co/docs/transformers/en/model_doc/roberta